Oracle Blog

All About InnoDB

Friday Jan 10, 2014

This is part of the ongoing work on improving the transaction life
cycle management. In 5.7.2 we split the transaction list into two.
The read-only transaction list and the read-write transaction list.
There was another "virtual" list, the auto-commit non-locking
read-only (AC-NL-RO) transaction list. The change in 5.7.2 was that
by default a transaction was treated as read only and added to the
read-only transaction list. Only when it was determined that the
transaction was going to do an update we removed the transaction
from the read-only list and moved it to the read-write transaction
list. This initial add to the the read-only list forced the
acquisition of the trx_sys_t::mutex. Acquiring the mutex during
transaction start/begin has a cost. Promoting a transaction from
read-only to read-write we had to acquire the trx_sys_t::mutex to
add to the read-write transaction list and so that is not too
expensive and unavoidable. There is another transaction list for
caching user transactions that we will ignore in this discussion
(the mysql-trx-list) it is per connection. All user transactions
both AC-NL-RO, plain read-only and read-write user transactions are
on this list.

The optimization

The optimization in 5.7.3 is to eliminate the explicit read-only
transaction list altogether. By default transactions are not put on
any list unless they are explicitly tagged as read-write then they
are added to the read-write transaction list. This makes the
transaction begin mutex free. Additionally if the transaction
commits as a read-only transaction then we don't need to acquire the
trx_sys_t::mutex at commit time to remove from the read-only list
either. This improves performance for read-only transactions, making
them more or less equivalent to AC-NL-RO transactions. They will
however incur some cost compared to AC-NL-RO transactions if they
acquire shared locks.

Additionally, the MVCC view open/close handling of read-only
transactions is now equivalent to that of AC-NL-RO transactions,
very low life cycle overhead. This will also help in improving
performance significantly.

User impact

There is of course the positive performance impact but there is also
a change in visibility semantics. With the elimination of the
read-only list the read-only transactions are now no longer visible
via SHOW ENGINE INNODB STATUS; they are however visible via the
INFORMATION SCHEMA.

Thursday Dec 12, 2013

As you probably already know, in MySQL
5.7.3 release, InnoDB Memcached reached a record of over 1 million
QPS on a read only load. The overview of the benchmark and testing
results can be seen in an earlier blog by Dimitri.
In this blog, I will spend sometime on the detail changes we
have made to achieve this number.

First thanks to Facebook's
Yoshinori with his bug#70172 that brought our attention to this single commit read only load test.
We have been focussing on operation with large batch size. This bug
prompted us to do a series of optimization on single commit read only
queries and these optimizations eliminate almost all major
bottlenecks from the InnoDB Memcached plugin itself.

If you are just getting familiar with
InnoDB Memcached, there are some earlier blog on the topics to get you started. In a short word, InnoDB Memcached allows a fast path to retrieve key value data stored in the InnoDB table, with Memcached protocol.

The Benchmark:

Now,
Let's discuss the testing scenario. The InnoDB Memcached plugin
configurations are all by default in this benchmark, which means, the
daemon_memcached_r_batch_size was also set to be 1, and the read
operation would do a begin and commit transaction for each query. It
is equivalent to auto-commit single selects through SQL interface.
The innodb_api_trx_level is by default set to 0 (read uncommitted),
however, changing it to 2 (repeatable read) gave the same benchmark
result.

Another good news in 5.7.3 is that we start to support
integer key column mapping, as it is common to use integer as primary
key for a table. And the table used in this benchmark comes with
integer as the key column. The mapping table contains only key and
value columns. So we set the corresponding `flags`, `cas_column` and
`expire_time_column` column in the config containers table all to
NULL, this avoids overhead columns to support Memcached "extra" options. The
table itself containers 1 million rows, each with a short integer key
and a text value.

Here is the detail table definition

mysql> desc test.memc_test;

Field

Type

Null

Key

Default

Extra

id

int(11)

NO

PRI

NULL

value3

text

YES

NULL

To make InnoDB Memcached recognize this InnoDB table, insert following row into innodb_memcache/containers table

The memcached client for inserting
rows and querying is a simple libmemcached program provided by
Yoshinori.Each query would do a key lookup and fetches corresponding value.

We made some
adjustment so that there are multiple client processes, each with
multiple sessions. This was used to alleviate bottlenecks in the
client itself.

As a note, there are many memcached
clients out there, and Memcached clients can play important roles
in the performance result itself. For example, we observed at least
3 times difference on result with Perl client Cache::Memcached::Fast
when comparing to its slower version Cache::Memcached. And as far as
we can see, libmemcached is one of the most efficient clients
available, even though eventually it becomes bottleneck itself as the
test progresses, especially requests through the network.

The
test result can be seen in Dimitri's blog, so I will not repeat them
here. The summary is that we got close to 1.2 million QPS at the
best. The next bottleneck now seems lying at the adaptive hash index's global latch -
"btr_search_latch". The libmemcached client overhead is also
significant.

Read Only QPS through
InnoDB Memcached

The improvement:

There
are several changes in both InnoDB Memcached code and Memcached
Native code to achieve the record benchmarks.

1. The first
is of course to address the issue brought by bug #70712. With
daemon_memcached_r_batch_size set to 1, the transaction is being
repeatedly started and committed for each query. It is better to
cache the trx object itself, to avoid repeated create and destroy the trx
object. Otherwise, the "trx_sys mutex" will kill the concurrency.

After the change,
the trx object is cached with private memcached connection data. Each
connection gets its own trx object, and it is used to handle
transactions through this particular connection.

2. The next
thing we did is to take advantage of the read only optimization
recently made in the InnoDB code. This scenario (single read trx) is
perfect to use the optimization. Whenever the read batch size is set
to 1, InnoDB Memcached will treat incoming queries as auto-commit
read only query. It will automatically hook up to the "fast
path" of read-only operation in InnoDB.

3. After these
two transaction related changes, we found the bottleneck comes from
Memcached native Code itself. As a note, we embedded the Memcached code itself in our InnoDB Memcached plugin, so any bottleneck in Memcached will affect us.

The original Memcached memory allocation is
protected by a global Cache Lock (engine->cache_lock), and it
quickly rises in prominence in the profiling result.

Even though the data is stored in InnoDB, we happened to
still use some of Memcached's own memory allocation to store and deliver the
result back to the front end. To fix this bottleneck, we stopped using Memcached Memory altogether. Instead a connection private
memory buffer is used to store and deliver the result. This also
saves a memcpy as we move the data to memcached memory as
before.

This change makes InnoDB Memcached plugin as thin as
possible, and only relies on the InnoDB buffer pool and Adaptive Hash
Index (AHI) as the backing store for the data. This provides better
scaling and memory handling than Memcached itself.

4. Another
bottleneck in Memcached is its statistics mutex
("thread_stats->mutex"). This also becomes significant as testing goes.
So to remove it, we switched to using atomic operations whenever the
platform supports (most modern platforms do).
With these changes, we can now well scale the plugin to over 100
connections without degradation as the number of connections are
ramped up.

5. In addition to removing those major bottlenecks,
we also streamline the code to remove some overhead work. For
example, we start to cached the "search tuple", so that
there is no need to allocate the search tuple for each query. This is
to keep the InnoDB Memcached as lean as possible.

With these
changes, we have eliminated all the major InnoDB Memcached Plugin
bottlenecks. The bottlenecks now comes from clients themselves and to
a lesser degree from the Adaptive Hash Index search latch.

Future
work:

Now the Memcached read goes more than twice as fast
as those from SQL end. By using the InnoDB buffer pool as the
in-memory store, and with InnoDB AHI, InnoDB Memcached can probably provide an
efficient and more scalable store than Memcached itself.

There
is still more to be done.

1. We will continue to remove some
bottlenecks in InnoDB (such as btr_search_latch), as well as make
InnoDB memcached leaner/faster.

2. We will add support to
"mgets" command, which allows Memcached to fetch multiple
results (corresponding to multiple keys) in one query attempts. This
would again give us another big jump in terms of QPS.

3. We
will start to focus more on insertion/updates operations.

4.
We are considering extending the functionality of the memcached
interface to support range queries etc. So to make it a more versatile
key value store.

In summary, with these enhancements, the
InnoDB Memcached becomes more and more attractive as as quick
key value store through the MySQL server.

Your feedback and
comments are important to us as we evolve and improve this plugin.

What does this means in to InnoDB ?

REDO logging can be avoided for temporary tables and related objects since temporary tables do not survive a shutdown or crash.

Temporary table definitions can be maintained in-memory without persisting to the disk.

Locking constraints can be relaxed since only one client can see these tables.

Change buffering can be avoided since the majority of temporary tables are short-lived.

In order to implement these changes in InnoDB we took a bit different approach:

We introduced a dedicated tablespace for housing temporary tables. In 5.7, all non-compressed temporary tables (irrespective of innodb-file-per-table setting) are housed in this new temporary tablespace (ibtmp1). Compressed temporary tables continue to get housed in their independent tablespace. This new temporary tablespace is re-created on each server restart and can be pre-extended before starting server. (For more about temporary tablespace check MySQL documentation).

All related objects including rollback-segments have been moved to this new temporary tablespace. This helps in maintaining locality and importantly eliminating REDO logging associated with all such objects.

Friday Sep 13, 2013

Introduction

This article describes the InnoDB redundant row format. If you are
new to InnoDB code base (a new developer starting to work with
InnoDB), then this article is for you. I'll explain the row format by
making use of a gdb session. An overview of the article is given
below:

Create a simple table and populate few rows.

Access the page that contains the rows inserted.

Access a couple of rows and explain its format.

Give summary of redundant row format.

Useful gdb commands to analyse the InnoDB rows.

Look at a GNU Emacs Lisp function to traverse rows in
an InnoDB index page.

To get the most out of this article, the reader is expected to
repeat the gdb session as described here.

The Schema

When the above CREATE TABLE is executed, InnoDB internally creates
a clustered B-tree index (or just clustered index) representing the
table. If no primary key is explicitly given, internally a primary
key is added. Each node of B-tree is represented by one page (which
is by default is of size 16KB). In the given scenario, we insert only
5 records. So the B-tree will consist of only 1 node, the root
node.

Access InnoDB Index Page in gdb

All select queries will go through the row_search_for_mysql()
function of InnoDB. When the first time it is called for a query, a
persistent cursor (btr_pcur_t) will be opened by this function. When
this B-tree persistent cursor (pcur in the following gdb) points to
the InnoDB index page containing the rows of our interest, we can save
the page frame to a file as follows:

If you do not understand the commands above, refer to the gdb
manual for more information. Saving a page in binary format in a file
on disk helps us to analyse the contents of the page by loading it in
a text editor like GNU Emacs or ViM. This is sometimes a very useful
approach for debugging. The above commands were executed when I was
on the sync point row_search_rec_loop, in the function
row_search_for_mysql(). The line numbers are from mysql 5.7.3.

An important point to remember when analysing the page is that all
the data in the page are stored in big-endian format.

Open Page in GNU Emacs

The following image shows the InnoDB page opened in GNU Emacs in
hexl-mode major mode. I have placed the cursor on the origin of the
infimum record (at offset 0x65 from page origin). In the right side,
you can look at the text "infimum" and "supremum" as stored in the
page. These are special records.

All rows in an InnoDB page are linked to one another in ascending
order via the next record pointer. In the given example this ordering
is based on the internally generated DB_ROW_ID. In each page, the
first record (or the minimum record) is the infimum record, and the
last record (or the maximum record) is the supremum record. The
records between the infimum and supremum records are called the user
records.

The next record pointer of each record points to the "origin" of
the next record. The record contains data to the right of the origin
and meta data to the left of the origin. The next record pointer is
available in the 2 bytes before the record origin.

The next record pointer of infimum as given in the picture above
is 0x0087 (which is 135 in decimal). This is the offset of the next
record from the beginning of the page.

Printing the infimum record

As mentioned previously, infimum is a special minimum record in an
index page. It exists at a fixed offset of 101 bytes from the page
beginning. The function page_get_infimum_rec() can be used to get the
infimum record given a page frame pointer. All records in InnoDB are
pointed to at their record origin. So to access the meta data in the
record, we need to subtract offset from the record origin.

The above 6 bytes bit fields of the record is formatted as
follows: (it also contains the actual values as seen above)

4 bits info bits byte: 6 0000

4 bits n_owned byte: 6 0001

13 bits heap number byte: 4,5 00000000 00000

10 bits field count byte: 3,4 000 0000001

1 bit flag byte: 3 1

2 bytes next record offset byte: 1, 2 00000000 10000111

Each of the box above represents one bit field. It contains the
following information:

The size of bit field in bits or bytes
A short description of the bit field.
The byte where this bit field exists.
The actual value of this bit field as seen in above gdb output

The 1 bit flag above is used to tell whether
the offset array element is 1 byte (if set) or 2 bytes (if not set).
In our example case, the 1 bit flag is set and hence the offset array
elements are each 1 byte long.

From the field count bit field, we can see that the infimum record
has only one field, whose value is fixed to "infimum". The next record
pointer of the infimum record points to the first user record in the
page. This is available at the offset 0x0087, which is 135 in
decimal. So the first user record is available at page frame pointer
+ 135 bytes.

The Bit Fields

The information available in 6 bytes (REC_N_OLD_EXTRA_BYTES) of
bit field data is listed here. These six bytes are available
immediately before the record origin. These bit fields must be
accessed by making use of the interface provided by the record manager
(refer to rem0rec.h).

The Bit Field

Size of Bit Field

Description of the Bit Field

info bits

4 bits

Used to delete mark a record

n_owned

4 bits

Number of records owned by this record. This is related to
page directory. To understand this field, we need to understand
the page structure used by InnoDB. This is not explained further
in this article.

heap number

13 bits

Gives the order number of this record in the page. Each row
has a unique heap number in the page. The lock manager uses this heap
number.

field count

10 bits

Number of fields in this record.

flag

1 bit

Tells if the offset array element is 1 or 2 bytes.

next record offset

2 bytes

Offset of the next record from page origin.

Printing the First User Record

To first parse the record we need to know the number of records in
the row. This can be obtained from the 6 bytes of bit-field
information available before the record origin. Let us first see this
for the user record:

The above information shows that this record
has 4 fields (000 0000100 in binary). But our table definition was

CREATE TABLE t1 (f1 unsigned int) row_format=redundant engine=innodb;

If you are wondering why a row of this table has 4 field count, it
is because InnoDB added 3 internal fields to the table t1 - a row
identifier (DB_ROW_ID), transaction identifier (DB_TRX_ID) and a undo
roll pointer (DB_ROLL_PTR). A row identifier column is internally
added to a table if no primary key is explicitly given by the user.
This row identifier becomes the internal primary key. The transaction
identifier specifies the transaction that has last updated the row.
The undo roll pointer points to the previous versions of the record.
The 4th field is the actual column f1 added by the user.

The list of fields in the table t1 can be printed by accessing its
associated dict_table_t object (or its clustered index dict_index_t
object). In the following, the name GEN_CLUST_INDEX is the name of
the clustered index of the table t1. This is the standard name of the
clustered index if it was internally generated by InnoDB.

Printing the Record Offsets (or Offset Array)

The redundant record contains an array of offsets of each field of
the row. This array is available after the 6-byte bit fields (down
from the record origin). The size of each element of the array is
either 1 byte or 2 byte depending on the 1 bit flag in the 6 byte bit
fields. In our current case each offset element is 1 byte long.

The first field (DB_ROW_ID) starts at offset 0 from record origin.
The second field (DB_TRX_ID) starts at offset 6 from record origin.
The third field (DB_ROLL_PTR) starts at offset 12 bytes from record
origin. The fourth field (f1) starts at offset 19 bytes from record
origin. The end of the record is 23 bytes from the record origin. So
the total size of this record can be calculated as follows:

Thus, we have completely accessed the first user record. Given
the first user record, we can find out the location of the next user
record (the first 2 bytes before the record origin). If we keep
accessing the next user record, we will eventually reach the supremum
record, which is the last (and considered largest or maximum) record
in a page.

Printing All Rows

Just to improve our confidence in accessing the rows, here I have
printed all the rows that we have inserted into the table. The first
record is the infimum, followed by the records 1, 2, 3, 4 and 5 (which
we have inserted) and then the last record supremum.

The Complete Redundant Row Format

One redundant record can be logically separated into 5 parts - the
offsets array, the 6-byte bit field, record origin, hidden fields
(DB_ROW_ID, DB_TRX_ID, DB_ROLL_PTR) and the actual user-specified
fields of the table. They can be viewed as follows:

offsets array fn ... f1, x2, x1, pn ... p1

6-byte bit field

origin

primary key DB_ROW_ID or user given p1, p2 ... pn

hidden fields DB_TRX_ID DB_ROLL_PTR x1, x2

user fields f1, f2, f3 ... fn

All record pointers point to the record
origin. The record origin is the beginning of the primary key for the
clustered primary index. It is not a separate entity that occupies
some space. I am showing it separately to make it easy to
visualize. All information prior to the record origin can be
considered as the record header or the record meta-data.

Each data field has a corresponding offsets array element. The
offsets array is arranged in reverse order. The first element in the
array provides the offset of the end of the last data field and the
last element in the array provides the offset of the end of the first
data element.

The offsets array also tells whether a data field is NULL. If a
field is NULL, the highest bit in the associated offset array element
will be set to 1.

The biggest advantage of the redundant row format is that a record
in this format can be parsed without needing to access the data
dictionary. But this format is not space efficient. Even for fixed
length fields, we store information in the offsets array (this is
avoided in COMPACT, DYNAMIC and COMPRESSED row formats).

One more limitation of the redundant row format is that the blob
data cannot be stored completely off page (as in DYNAMIC and
COMPRESSED row formats). A 768 byte prefix of the blob will always be
stored inline with the row and cannot be avoided.

A small note on row formats. InnoDB really supports only two row
formats - the redundant row format and the compact row format. The
compressed and dynamic row formats are a variation available only on
top of the compact row format. Technically these variations can be
implemented for the redundant row format as well, but is currently not
available.

GNU Emacs Function to Traverse Rows in Page

If we dump a single InnoDB index page into a file, we can open it
in GNU Emacs hexl-mode and use the following elisp function to
traverse from infimum record to the supremum record. First place the
cursor on the 'i' of the infimum record and then call this function to
reach the next record origin. I am not an expert in writing elisp
functions. If you can improve on this and write a better one, kindly
share it with me.

(defun ib-next-red-row()"Go to the origin of the next record in ROW_FORMAT=REDUNDANT.Thecursormustbepositionedattheoriginofarecord.
ThisisfortheGNUEmacshexl-mode. "(interactive)(setqrec-origin(point))(setqrec-origin-offset(hexl-current-address));; The next-rec pointer is stored in 2 bytes before the;; record origin(hexl-backward-char2)(setqnext-rec(string(char-after)))(forward-char1)(setqnext-rec(concatnext-rec(string(char-after))))(hexl-forward-char1)(setqnext-rec(concatnext-rec(string(char-after))))(forward-char1)(setqnext-rec(concatnext-rec(string(char-after))));; Next record is next-rec-off bytes away from page origin(setqnext-rec-off(hexl-hex-string-to-integernext-rec));; Assuming that there is only one page in dump(hexl-goto-address0)(hexl-forward-charnext-rec-off))

Conclusion

In this article, the format of the redundant row was explained. A
gdb session was used to demonstrate how to access a page and the rows
within that page. The infimum record was accessed, printed and
analysed. Using its next record pointer, the first user record was
accessed. The reader can use the next record pointer of the first
user record to access the next user record and so on till the supremum
record is reached. I leave that as an exercise for the readers.

Thanks to Marko Makela for his useful tips on debugging at our yearly
InnoDB Team Meeting 2013 held at Shanghai, China. This article
derives from his ideas.

Monday Jul 15, 2013

Introduction

InnoDB is a
general-purpose storage
engine that balances high reliability and high performance. It is
a transactional storage engine and is fully ACID compliant, as would
be expected from any relational database. The durability guarantee
provided by InnoDB is made possible by the redo logs.

This article will provide an overview of the redo log subsystem or
log subsystem of InnoDB. We will look at the following details:

The global log system object, which provides access to important
data structures and information.

The mini-transaction (mtr), using which all redo log records
are created.

The global in-memory log buffer (or just log buffer), into which
the redo logs are written to from the mini transaction buffer.
This log buffer will be periodically flushed to the log file on
disk.

The redo log files on the disk and their high level internal
structure.

We will discuss about the concept of LSN, and how the various
values of LSN are used to implement the write-ahead logging (WAL).

Redo Log Generation

In the article, Data Organization in InnoDB,
we already saw that the user data files of InnoDB (.ibd files) are
considered to be a sequence of equal sized pages. These pages are
uniquely identified within the InnoDB system using the space_id,
page_no combination. If we want to read or modify any page, it needs
to be loaded into memory. So there are two copies of the pages - on
disk and in memory. Here are the high level steps in redo log
generation.

Any changes to a page is first done to the in-memory copy of the
page. The page that is modified in memory and not yet flushed to disk
is marked as the dirty page in memory.

An associated redo log is generated in memory, in the local mini
transaction (mtr) buffer. This will then be transferred to the global
in-memory redo log buffer.

The redo log record is written from the redo log buffer in memory
to the redo log file on disk. This is subsequently flushed. These
two steps are considered separate - writing to the redo log file and
flushing the redo log file to the disk. This is to account for the
file buffering done by the operating system.

The dirty page is flushed from memory to disk at some later point
of time as part of the checkpointing operation.

The order of these steps are important. The redo log record of a
change must be flushed to disk before flushing the corresponding dirty
page to the disk. This is the concept
of write-ahead
logging (WAL).

The generated redo log record will contain information necessary
to repeat the same operation later during a database recovery. So the
redo log record will contain information about the original set of
pages and what has been changed in them. Using the redo log record
the set of dirty pages will be re-created during database
recovery.

Redo Log Files

By default, InnoDB creates two redo log files (or just log
files) ib_logfile0 and ib_logfile1 within the data
directory of MySQL. In MySQL versions 5.6.8 and above, the default
size of each redo log file is 48MB each. This can be configured by
the user by making use
of
innodb_log_file_size server option. The number of log files is
controlled
by innodb_log_files_in_group
server option.

A log group consists of a number of log files, each of same size.
As of MySQL 5.6, InnoDB supports only one log group. So I'll not
discuss this further.

The redo log files are used in a circular fashion. This means
that the redo logs are written from the beginning to end of first redo
log file, then it is continued to be written into the next log file,
and so on till it reaches the last redo log file. Once the last redo
log file has been written, then redo logs are again written from the
first redo log file.

The log files are viewed as a sequence of blocks called "log
blocks" whose size is given by OS_FILE_LOG_BLOCK_SIZE which is equal
to 512 bytes. Each log file has a header whose size is given by
LOG_FILE_HDR_SIZE, which is defined as 4*OS_FILE_LOG_BLOCK_SIZE.

Redo Log File Header

Each redo log file contains a header occupying four log blocks
with the following information:

The first 4 bytes contain the log group number to which the log
file belongs.

The next 8 bytes contain the lsn of the start of data in this log
file.

First checkpoint field located in the beginning of the second log
block.

Second checkpoint field located in the beginning of the fourth
log block.

The checkpoint field mentioned above contains information
necessary to identify a checkpoint - a checkpoint number, the LSN of
the checkpoint, checksum information and more.

Log Blocks

A redo log file can be viewed as a sequence of log blocks. All
log blocks, except the ones belonging to the log file header, contain
a header and a footer. The header is of size LOG_BLOCK_HDR_SIZE bytes
(which is defined as 12 bytes). The log block trailer is of size
LOG_BLOCK_TRL_SIZE (4 bytes). The log block header contains the
following information:

The log block number. This field is of 4 bytes.

Number of bytes of log written to this block. This field is of
2 bytes.

Offset of the first start of an mtr log record group in this
log block or 0 if none

The checkpoint number to which this log block belongs

The log block trailer contains checksum of the log block
contents.

Log Sequence Number (LSN)

The log sequence number (LSN) is an important concept. The LSN is
an offset into the redo log file. Within InnoDB the log sequence
number is represented by the type lsn_t, which is an 8-byte unsigned
integer. There are different LSN values that are of interest. The
following table lists some of the LSN values discussed in this
article. (Note: The log_sys is the global log system object
which is explained in next section.)

LSN

Description

log_sys->lsn

The redo log record that will be generated next will make use of
this lsn

log_sys->flushed_to_disk_lsn

The redo log file is flushed upto this LSN. All redo log
records whose LSN is < flushed_to_disk_lsn is safely on
the disk.

log_sys->write_lsn

There is a currently running write operation which will write
upto this LSN.

log_sys->current_flush_lsn

There is a currently running write + flush operation which will
write upto this LSN.

The LSN is what that links the dirty page, the redo log record,
and the redo log file. Each redo log record when it is copied to the
in-memory log buffer, it gets an associated LSN. When each database
page is modified, redo log records are generated. So each database
page is also associated to an LSN. The page lsn is stored in a header
for each page. The page lsn gives the lsn upto which the redo log
file must be flushed before flushing the page.

Global Log System Object

The global log system object log_sys (of type log_t)
holds important information related to log subsystem of InnoDB. Here
only those information related to redo log buffer and redo log files
are discussed. The global log_sys identifies the "active
area" of the log buffer, which contains redo logs that are yet to be
safely written on the disk. It also identifies the area in the redo
log file, into which the active area of the log buffer will be
written/flushed.

/** Redo log buffer */structlog_t{lsn_tlsn;/*!< log sequence number */ulintbuf_free;/*!< first free offset within the log
buffer */byte*buf_ptr;/* unaligned log buffer */byte*buf;/*!< log buffer */ulintbuf_size;/*!< log buffer size in bytes */ulintmax_buf_free;/*!< recommended maximum value of
buf_free, after which the buffer is
flushed */ulintbuf_next_to_write;/*!< first offset in the log buffer
where the byte content may not exist
written to file, e.g., the start
offset of a log record catenated
later; this is advanced when a flush
operation is completed to all the log
groups */lsn_twrite_lsn;/*!< end lsn for the current running
write */ulintwrite_end_offset;/*!< the data in buffer has
been written up to this offset
when the current write ends:
this field will then be copied
to buf_next_to_write */lsn_tcurrent_flush_lsn;/*!< end lsn for the current running
write + flush operation */lsn_tflushed_to_disk_lsn;/*!< how far we have written the log
AND flushed to disk */};

The member log_sys->buf points to the in-memory redo
log buffer. This is the buffer into which an mtr_commit() writes
the redo log records. The size of this buffer is given
by log_sys->buf_size in bytes.

The member log_sys->buf_free is the offset within the
in-memory redo log buffer, where the next redo log record will be
written. This is the end offset for the next redo log flush to
disk.

The member log_sys->buf_next_to_write is the offset
from where the redo log records are not yet written to the redo log
file. When the redo log buffer is flushed to disk next time, it
will flush from this location. This is the start offset for the next
redo log flush to disk.

The member log_sys->flushed_to_disk_lsn marks the lsn
upto which we have written to the log file on the disk and flushed
it. So upto this lsn, the redo logs are safely on the disk. This
will always be less than or equal to log_sys->write_lsn
and log_sys->lsn.

The member log_sys->lsn represents the current lsn.
This member will be updated whenever we do mtr_commit(). The
function mtr_commit() writes the bunch of redo log records generated
in the mini transaction to the global or system wide redo log buffer
in memory. This will always be greater than or equal
to log_sys->flushed_to_disk_lsn
and log_sys->write_lsn. This will be the LSN of the redo
log record written at log_sys->buf_free.

The member log_sys->write_lsn represents the end lsn
of a currently running redo log buffer write operation. This will
be greater than or equal to log_sys->flushed_to_disk_lsn
and less than or equal to log_sys->lsn.

The member log_sys->current_flush_lsn represents the
end lsn of a currently running redo log buffer write + flush
operation. This will be mostly equal
to log_sys->write_lsn.

The global log_sys object points to various positions in
the in-memory redo log buffer and on-disk redo log files. The
following picture shows the locations pointed to by the
global log_sys object. The picture clearly shows that the
redo log buffer maps to a specific portion of the redo log file.

Global In-memory Redo Log Buffer

The in-memory redo log buffer is global and all redo logs
generated by user transactions will be written into this buffer. The
size of this buffer is configurable and is given by
the innodb_log_buffer_size.
The default size of this redo log buffer is 8MB.

When a transaction is running and if it is modifying the database,
then it will generate redo logs and populate this log buffer. This
log buffer will be written or flushed to the log file, either when the
transaction commits, or when the log buffer gets full.

When the redo log buffer is full, and there is not enough space
for an mtr_commit(), which will transfer a group of redo log
records to log buffer, then a synchronous flush of the log buffer
is done to the redo log file by using the function
log_buffer_flush_to_disk(). This function will write the log buffer
from log_sys->buf_next_to_write
to log_sys->buf_free. In terms of LSN, the function
log_buffer_flush_to_disk() flushes the redo logs
from log_sys->flushed_to_disk_lsn
to log_sys->lsn.

The mini transaction (mtr)

A mini transaction (mtr) must be used to generate all the redo log
records. A mini transaction contains a local buffer (called the mini
transaction buffer) into which the generated redo log records will be
stored. If we need to generate a group of redo log records such that
either all make it to the redo log file or none makes it, then we need
to put them in a single mini transaction. Apart from the redo log
records, the mini transaction also maintains a list of pages that has
been modified (dirty pages).

The normal usage of a mini transaction is as follows:

Create a mini transaction object of type mtr_t

Start the mini transaction with mtr_start() function. This
will initialize the mini transaction buffer.

Generate the redo log records by making use of
mlog_write_ulint() family of functions.

Commit the mini transaction with mtr_commit()
function. This will transfer the redo log records from mini
transaction buffer to the global redo log buffer. The list of
dirty pages is added to the buffer pool flush list.

The definition of a mini transaction is given here for your
reference. The member mtr_t::log contains the mini transaction buffer
which will hold the redo log records, and the member mtr_t::memo
contains a list of pages dirtied by this mini transaction.

/* Mini-transaction handle and buffer */structmtr_t{dyn_array_tmemo;/*!< memo stack for locks etc. */dyn_array_tlog;/*!< mini-transaction log */unsignedinside_ibuf:1;/*!< TRUE if inside ibuf changes */unsignedmodifications:1;/*!< TRUE if the mini-transaction
modified buffer pool pages */unsignedmade_dirty:1;/*!< TRUE if mtr has made at least
one buffer pool page dirty */ulintn_log_recs;/* count of how many page initial log records
have been written to the mtr log */ulintn_freed_pages;/* number of pages that have been freed in
this mini-transaction */ulintlog_mode;/* specifies which operations should be
logged; default value MTR_LOG_ALL */lsn_tstart_lsn;/* start lsn of the possible log entry for
this mtr */lsn_tend_lsn;/* end lsn of the possible log entry for
this mtr */};

Redo Log Record Types

When we modify a database page, a redo log record is generated.
This redo log record either contains what information has been changed
in the page (physical redo log) or how to perform the change again
(logical redo log). InnoDB uses a combination of physical redo logs
and logical redo logs.

To understand this consider an operation like page
re-organization. If we generated a physical redo log record for this
operation, it is likely that the redo log record generated will be
equal to the page size. But instead InnoDB uses logical redo log
record for this operation, which will reduce the redo log record size
significantly. In the case of logical redo log record to represent a
page reorganize, all we need is the information to uniquely identify a
page, and "type" of operation which is page reorganize.

So each redo log record has a type. A redo log record type helps
to identify the function that will be used to apply or execute the
redo log during recovery. The contents of the redo log record must
then contain all the arguments or parameters needed by the
function.

Life cycle of a redo log record

The life cycle of a redo log record is as follows:

The redo log record is first created by a mini transaction and
stored in the mini transaction buffer. It has information necessary
to redo the same operation again in the time of database
recovery.

When mtr_commit() is done, the redo log record is transferred to
the global in-memory redo log buffer. If necessary, the redo log
buffer will be flushed to the redo log file, to make space for the new
redo log record.

The redo log record has a specific lsn associated with it. This
association is established during mtr_commit(), when the redo log
record is transferred from mtr buffer to log buffer. Once the lsn is
established for a redo log record, then its location in the redo log
file is also established.

The redo log record is then transferred from the log buffer to
redo log file when it is written + flushed. This means that the redo
log record is now durably stored on the disk.

Each redo log record has an associated list of dirty pages. This
relationship is established via LSN. A redo log record must be
flushed to disk, before its associated dirty pages. A redo log record
can be discarded only after all the associated dirty pages are flushed
to the disk.

A redo log record can be used to re-create its associated list of
dirty pages. This happens during database recovery.

Conclusion

This article provided an overview of the redo log subsystem of
InnoDB storage engine. The major data structures used in the redo log
subsystem are the mini transaction (mtr), the in-memory redo log
buffer and the on-disk redo log files. The InnoDB storage engine
tracks many LSN values to ensure that the write-ahead logging (WAL)
happens correctly.

Since data loss is unacceptable for an RDBMS, the redo log
subsystem is very critical for a successful RDBMS. And since redo
logs are generated synchronously with DML operations, it is important
to do it efficiently. The size of redo logs must be kept as minimal
as possible.

Friday May 31, 2013

Introduction

Transaction
locks are an important feature of any transactional storage engine.
There are two types of transaction locks – table locks and row
locks. Table locks are used to avoid a table being altered or
dropped by one transaction when another transaction is using the
table. It is also used to prohibit a transaction from accessing a
table, when it is being altered. InnoDB supports multiple
granularity locking (MGL). So to access rows in a table,
intention locks must be taken on the tables.

Row
locks are at finer granularity than table level locks, different
threads can work on different parts of the table without interfering
with each other. This is in contrast with MyISAM where the entire
table has to be locked when updating even unrelated rows. Having row
locks means that multiple transactions can read and write into a
single table. This increases the concurrency level of the storage
engine. InnoDB being an advanced transactional storage engine,
provides both table and row level transaction locks.

This
article will provide information about how transaction locks are
implemented in InnoDB storage engine. The lock subsystem of InnoDB
provides many services to the overall system, like:

Creating,
acquiring, releasing and destroying row locks.

Creating,
acquiring, releasing and destroying table locks.

Providing
multi-thread safe access to row and table locks.

Data
structures useful for locating a table lock or a row lock.

Maintaining
a list of user threads suspended while waiting for transaction
locks.

Notification
of suspended threads when a lock is released.

Deadlock
detection

The
lock subsystem helps to isolate one transaction from another
transaction. This article will provide information about how
transaction locks are created, maintained and used in the InnoDB
storage engine. All reference to locks means transaction locks,
unless specified otherwise.

Internal Data Structures of InnoDB

Before
we proceed with scenarios and algorithms, I would like to present the
following data structures. We need to be familiar with these data
structures to understand how transaction locks work in InnoDB. The
data structures of interest are:

The
enum lock_mode – provides the list of modes in which the
transaction locks can be obtained.

The
lock struct lock_t. This represents either a table lock or a row
lock.

The
struct trx_t which represents one transaction within InnoDB.

The
struct trx_lock_t which associates one transaction with all its
transaction locks.

The
global object of type lock_sys_t holds the hash table of row locks.

The
table descriptor dict_table_t, which uniquely identifies a table in
InnoDB. Each table descriptor contains a list of locks on the
table.

The lock struct or lock object

The
structure lock_t represents either a table lock (lock_table_t) or a
group of row locks (lock_rec_t) for all the rows belonging to the
same page. For different lock modes, different lock structs will be
used. For example, if a row in a page is locked in LOCK_X mode, and
if another row in the same page is locked in LOCK_S mode, then these
two row locks will be held in different lock structs.

This structure is defined as follows:

structlock_t{trx_t*trx;ulinttype_mode;hash_node_thash;dict_index_t*index;union{lock_table_ttab_lock;/*!< table lock */lock_rec_trec_lock;/*!< record lock */}un_member;/*!< lock details */};/** A table lock */structlock_table_t{dict_table_t*table;/*!< database table in dictionary
cache */UT_LIST_NODE_T(lock_t)locks;/*!< list of locks on the same
table */};/** Record lock for a page */structlock_rec_t{ulintspace;/*!< space id */ulintpage_no;/*!< page number */ulintn_bits;/*!< number of bits in the lock
bitmap; NOTE: the lock bitmap is
placed immediately after the
lock struct */};

The important point here is the lock bitmap. The lock bitmap is a space efficient way to represent the row locks. This space efficient way of representing the row locks avoids the need for lock escalation and lock data persistence. (Note: For prepared transactions, it would be useful to have lock data persistence, but InnoDB currently do not support lock data persistence.)

The lock bitmap is placed immediately after the lock struct object. If a page can contain a maximum of 100 records, then the lock bitmap would be of size 100 (or more). Each bit in this bitmap will represent a row in the page. The heap_no of the row is used to index into the bitmap. If the 5th bit in the bitmap is enabled, then the row with heap_no 5 is locked.

The Transaction And Lock Relationship

The struct trx_t is used to represent the transaction within InnoDB. The struct trx_lock_t is used to represent all locks associated to a given transaction. Here I have listed down only those members relevant to this article.

structtrx_t{trx_id_tid;/*!< transaction id */trx_lock_tlock;/*!< Information about the transaction
locks and state. Protected by
trx->mutex or lock_sys->mutex
or both */};structtrx_lock_t{ib_vector_t*table_locks;/*!< All table locks requested by this
transaction, including AUTOINC locks */UT_LIST_BASE_NODE_T(lock_t)trx_locks;/*!< locks requested
by the transaction;
insertions are protected by trx->mutex
and lock_sys->mutex; removals are
protected by lock_sys->mutex */};

Global hash table of row locks

Before we look at how the row locks internally works, we need to be aware of this global data structure. The lock subsystem of the InnoDB storage engine has a global object lock_sys of type lock_sys_t. The class lock_sys_t is defined as follows:

The lock_sys_t::rec_hash member is the hash table of the record locks. Every page within InnoDB is uniquely identified by the (space_id, page_no) combination, called the page address. The hashing is done based on the page address. So given the page address, we can locate the list of lock_t objects of that page. All lock structs of the given page will be in the same hash bucket. So using this mechanism we can locate the row lock of any row.

The Transaction Subsystem Global Object

The transaction subsystem of InnoDB has one global object trx_sys of type trx_sys_t, which is an important internal data structure. It is defined as follows:

/** The transaction system central memory data structure. */structtrx_sys_t{
// more ...
trx_list_trw_trx_list;/*!< List of active and committed in
memory read-write transactions, sorted
on trx id, biggest first. Recovered
transactions are always on this list. */trx_list_tro_trx_list;/*!< List of active and committed in
memory read-only transactions, sorted
on trx id, biggest first. NOTE:
The order for read-only transactions
is not necessary. We should exploit
this and increase concurrency during
add/remove. */
// more ...
};

This global data structure contains the active transaction table (ATT) of InnoDB storage engine. In the following sections, I'll use this global object to access my transaction object through the debugger to demonstrate the transaction locks.

The table descriptor (dict_table_t)

Each table in InnoDB is uniquely identified by its name in the form of databasename/tablename. For each table, the data dictionary of InnoDB will contain exactly one table descriptor object of type dict_table_t. Given the table name, the table descriptor can be obtained. This table descriptor contains a list of locks on the table. This list can be used to check if the table has already been locked by a transaction.

structdict_table_t{
// more …
table_id_tid;/*!< id of the table */char*name;/*!< table name */UT_LIST_BASE_NODE_T(lock_t)locks;/*!< list of locks on the table; protected
by lock_sys->mutex */};

The heap_no of a row

Every row in InnoDB is uniquely identified by space_id, page_no and heap_no of the row. I assume that you know about space_id and page_no. I'll explain here only about the heap_no of the row. Each row in the page has a heap_no. The heap_no of the infimum record is 0, the heap_no of the supremum record is 1, and the heap_no of the first user record in page is 2.

If we have inserted 10 records in the page, and if there has been no updates on the page, then the heap_no of the user records would be 2, 3, 4, 5 … 10, 11. The heap_no will be in the same order in which the records will be accessed in ascending order.

The heap_no of the row is used to locate the bit in the lock bitmap corresponding to the row. If the row has a heap_no of 10, then the 10th bit in the lock bitmap corresponds to the row. This means that if the heap_no of the row is changed, then the associated lock structs must be adjusted accordingly.

The Schema

To demonstrate the transaction locks, I use the following schema in this article. There is one table t1 which has 3 rows (1, 'அ'), (2, 'ஆ') and (3, 'இ'). In this article, we will see when and how the transaction locks (table and row) are created and used. We will also cover the internal steps involved in creating table and row locks.

Table Level Transaction Locks

The purpose of table level transaction locks or simply table locks is to ensure that no transactions modify the structure of the table when another transaction is accessing or modifying the table or the rows in a table. There are two types of table locks in MySQL – one is the meta-data locking (MDL) provided by the SQL engine and the other is the table level locks within InnoDB storage engine. Here the discussion is only about the table level locks within InnoDB.

On tables, InnoDB normally acquires only intentional shared (LOCK_IS) or intentional exclusive (LOCK_IX) modes. It does not lock the tables in shared mode (LOCK_S) or exclusive mode (LOCK_X) unless explicitly requested via LOCK TABLES command. One exception to this is in the prepare phase of the online alter table command, where the table is locked in shared mode. Please refer to Multiple Granularity Locking to know about intention shared and intention exclusive locks.

Scenario (௧) for LOCK_IS table lock

Here is an example scenario that will take intention shared (LOCK_IS) lock on the table.

When the above 4 statements are executed, there would be two locks on the table – LOCK_IS and LOCK_IX. The select statement would have taken the table lock in LOCK_IS mode and the INSERT statement would take the table lock in LOCK_IX mode.

This can also be verified using the debugger. I'll leave it as an exercise for the reader.

What Happens Internally When Acquiring Table Locks

Each table is uniquely identified within the InnoDB storage engine using the table descriptor object of type dict_table_t. In this section we will see the steps taken internally by InnoDB to obtain a table lock. The function to refer to is lock_table() in the source code. Necessary mutexes are taken during these steps to ensure that everything works correctly in a multi-thread environment. This aspect is not discussed here.

The request for a table lock comes with the following information – the table to lock (dict_table_t), the mode in which to lock (enum lock_mode), and the transaction which wants the lock (trx_t).

Check whether the transaction is already holding an equal or stronger lock on the given table. Each transaction maintains a list of table locks obtained by itself (trx_t::trx_lock_t::table_locks). Searching this list for the given table and mode is sufficient to answer this question. If the current transaction is already holding an equal or stronger lock on the table, then the lock request is reported as success. If not, then go to next step.

Check if any other transaction has an incompatible lock request in the lock queue of the table. Each table descriptor has a lock queue (dict_table_t::locks). Searching this queue is sufficient to answer this question. If some other transaction holds an incompatible lock on the table, then the lock request needs to wait. Waiting for a lock can lead to time out or deadlock error. If there is no contention from other transactions for this table, then proceed further.

Allocate a lock struct object (lock_t). Initialize with table, trx and lock mode information. Add this object to the queue in dict_table_t::locks object of the table as well as the trx_t::trx_lock_t::table_locks.

Complete.

You can re-visit the above scenario (௨) and then follow the above steps and verify that the number of lock structs created is 2.

Row Level Transaction Locks

Each row in the InnoDB storage engine needs to be uniquely identified in order to be able to lock it. A row is uniquely identified by the following pieces of information:

The space identifier

The page number within the space.

The heap number of the record within the page.

The descriptor of the index to which the row belongs (dict_index_t). Stricly speaking, this is not necessary to uniquely identify the row. But associating the row to its index will help to provide user friendly diagnosis and also help the developers to debug a problem.

There are two types of row locks in InnoDB – the implicit and the explicit. The explicit row locks are the locks that make use of the global row lock hash table and the lock_t structures. The implicit row locks are logically arrived at based on the transaction information in the clustered index or secondary index record. These are explained in the following sections.

Implicit Row Locks

Implicit row locks do not have an associated lock_t object allocated. This is purely calculated based on the ID of the requesting transaction and the transaction ID available in each record. First, let use see how implicit locks are “acquired” (here is comment from lock0lock.cc):

“If a transaction has modified or inserted an index record, then it owns an implicit x-lock on the record. On a secondary index record, a transaction has an implicit x-lock also if it has modified the clustered index record, the max trx id of the page where the secondary index record resides is >= trx id of the transaction (or database recovery is running), and there are no explicit non-gap lock requests on the secondary index record. ”

As we can see, the implicit locks are a logical entity and whether a transaction has implicit locks or not is calculated using certain procedures. This procedure is explained here briefly.

If a transaction wants to acquire a row lock (implicit or explicit), then it needs to determine whether any other transaction has an implicit lock on the row before checking on the explicit lock. As explained above this procedure is different for clustered index and secondary index.

For the clustered index, get the transaction id from the given record. If it is a valid transaction id, then that is the transaction which is holding the implicit exclusive lock on the row. Refer to the function lock_clust_rec_some_has_impl() for details.

For the secondary index, it is more cumbersome to calculate. Here are the steps:

Let R1 be the secondary index row for which we want to check if another transaction is holding an implicit exclusive lock.

Obtain the maximum transaction id (T1) of the page that contains the secondary index row R1.

Let T2 be the minimum transaction id of the InnoDB system. If T1 is less than T2, then there can be no transaction holding implicit lock on the row. Otherwise, go to next step.

Obtain the corresponding clustered index record for the given secondary index record R1.

Get the transaction id (T3) from the clustered index record. By looking at the clustered index record, including its older versions, find out if T3 could have modified or inserted the secondary index row R1. If yes, then T3 has an implicit exclusive lock on row R1. Otherwise it does not have.

In the case of secondary indexes, we need to make use of the undo logs to determine if any transactions have an implicit exclusive row lock on record. Refer to the function lock_sec_rec_some_has_impl() for details.

Also note that the implicit row locks do not affect the gaps.

Explicit Row Locks

Explicit row locks are represented by the lock_t structures. This section provides some scenarios in which explicit row locks are obtained in different modes. It also briefly discusses the internal procedure to acquire an explicit row lock.

Scenario (௩) for LOCK_S row lock

For transaction isolation level REPEATABLE READ and less stronger levels, InnoDB does not take shared locks on rows. So in the following scenario, we use SERIALIZABLE transaction isolation level for demonstrating shared row locks:

This scenario is the same as the Scenario (௧). The verification via debugger alone is different. In the case of row locks, each lock object represents a row lock for all rows of a page (of the same lock mode). So a lock bitmap is used for this purpose which exists at the end of the lock struct object. In the above scenario, we can verify the locks as follows:

Note that there are 4 bits enabled. This means that there are 4 row locks obtained. The row in the page, and the bit in the lock bit map are related via the heap_no of the row, as described previously.

Scenario (௪) for LOCK_X row lock

Changing the above scenario slightly as follows, will obtain LOCK_X locks on the rows.

Verification of this through the debugger is left as an exercise for the reader.

What Happens Internally When Acquiring Explicit Row Locks

To lock a row, all information necessary to uniquely identify the row – the trx, lock mode, table space id, page_no and heap_no – must be supplied. The entry point for row locking is the lock_rec_lock() function in the source code.

Check if the given transaction has a granted explicit lock on the row with equal or stronger lock mode. To do this search, the global hash table of row locks is used. If the transaction already has a strong enough lock on the row, then nothing more to do. Otherwise go to next step.

Check if other transactions have a conflicting lock on the row. For this search also, the global hash table of row locks is used. If yes, then the current transaction must wait. Waiting can lead to timeout or deadlock error. If there is no contention for this row from other transactions then proceed further.

Check if there is already a lock struct object for the given row. If yes, reuse the same lock struct object. Otherwise allocate a lock struct object.

Initialize the trx, lock mode, page_no, space_id and the size of the lock bit map. Set the bit of the lock bitmap based on the heap_no of the row.

Insert this lock struct object in the global hash table of row locks.

Add this to the list of transaction locks in the transaction object (trx_t::trx_lock_t::trx_locks).

Releasing transaction locks

All the transaction locks acquired by a transaction is released by InnoDB at transaction end (commit or rollback). In the case of REPEATABLE READ isolation level or SERIALIZABLE, it is not possible for the SQL layer to release locks before transaction end. In the case of READ COMMITTED isolation level, it is possible for the SQL layer to release locks before transaction end by making use of ha_innobase::unlock_row() handler interface API call.

To obtain the list of all transaction locks of a transaction, the following member can be used – trx_t::trx_lock_t::trx_locks.

Conclusion

This article provided an overview of the transaction locks in InnoDB. We looked at the data structures used to represent transaction locks. We analysed some scenarios in which the various locks are created. We also looked at the internal steps that happen when a table lock or a row lock is created.

Friday Apr 19, 2013

Introduction

This article will explain how
the data is organized in InnoDB storage engine. First we will look
at the various files that are created by InnoDB, then we look at the
logical data organization like tablespaces, pages, segments and
extents. We will explore each of them in some detail and discuss
about their relationship with each other. At the end of this
article, the reader will have a high level view of the data layout
within the InnoDB storage engine.

The Files

MySQL will store all data within
the data directory. The data directory can be specified using the
command line option –data-dir or in the configuration file as
datadir. Refer to the Server
Command Options for complete details.

By default, when InnoDB is
initialized, it creates 3 important files in the data directory –
ibdata1, ib_logfile0 and ib_logfile1. The ibdata1 is the data file
in which system and user data will be stored. The ib_logfile0 and
ib_logfile1 are the redo log files. The location and size of these
files are configurable. Refer to Configuring
InnoDB for more details.

The data file ibdata1 belongs
to the system tablespace with tablespace id (space_id) of 0. The
system tablespace can contain more than 1 data file. As of MySQL 5.6,
only the system tablespace can contain more than 1 data file. All
other tablespaces can contain only one data file. Also, only the
system tablespace can contain more than one table, while all other
tablespaces can contain only one table.

The data files and the redo
log files are represented in the memory by the C structure
fil_node_t.

Tablespaces

By default, InnoDB contains
only one tablespace called the system tablespace whose identifier is
0. More tablespaces can be created indirectly using the
innodb_file_per_table configuration parameter. In MySQL 5.6, this
configuration parameter is ON by default. When it is ON, each table
will be created in its own tablespace in a separate data file.

The relationship between the
tablespace and data files is explained in the InnoDB source code
comment (storage/innobase/fil/fil0fil.cc) which is quoted here for
reference:

“A
tablespace consists of a chain of files. The size of the files does
not have to be divisible by the database block size, because we may
just leave the last incomplete block unused. When a new file is
appended to the tablespace, the maximum size of the file is also
specified. At the moment, we think that it is best to extend the file
to its maximum size already at the creation of the file, because then
we can avoid dynamically extending the file when more space is needed
for the tablespace.”

The last statement about
avoiding dynamic extension is applicable only to the redo log files
and not the data files. Data files are dynamically extended, but
redo log files are pre-allocated. Also, as already mentioned
earlier, only the system tablespace can have more than one data file.

It is also clearly mentioned
that even though the tablespace can have multiple files, they are
thought of as one single large file concatenated together. So the
order of files within the tablespace is important.

Pages

A data file is logically
divided into equal sized pages. The first page of the first data file
is identified with page number of 0, and the next page would be 1 and
so on. A page within a tablespace is uniquely identified by the page
identifier or page number (page_no). And each tablespace is uniquely
identified by the tablespace identifier (space_id). So a page is
uniquely identified throughout InnoDB by using the (space_id,
page_no) combination. And any location within InnoDB can be uniquely
identified by the (space_id, page_no, page_offset) combination, where
page_offset is the number of bytes within the given page.

How the pages from different
data files relate to one another is explained in another source code
comment: “A block's position in the tablespace is specified with
a 32-bit unsigned integer. The files in the chain are thought to be
catenated, and the block corresponding to an address n is the nth
block in the catenated file (where the first block is named the 0th
block, and the incomplete block fragments at the end of files are not
taken into account). A tablespace can be extended by appending a new
file at the end of the chain.” This means that the first page
of all the data files will not have page_no of 0 (zero). Only the
first page of the first data file in a tablespace will have the
page_no as 0 (zero).

Also in the above comment it
is mentioned that the page_no is a 32-bit unsigned integer. This is
the size of the page_no when stored on the disk.

Extents

An extent is 1MB of
consecutive pages. The size of one extent is defined as follows
(1048576 bytes = 1MB):

#define FSP_EXTENT_SIZE (1048576U
/ UNIV_PAGE_SIZE)

The macro UNIV_PAGE_SIZE used
to be a compile time constant. From mysql-5.6 onwards it is a global
variable. The number of pages in an extent depends on the page size
used. If the page size is 16K (the default), then an extent would
contain 64 pages.

Page Types

One page can be used for many
purposes. The page type will identify the purpose for which the page
is being used. The page type of each page will be stored in the page
header. The page types are available in the header file
storage/innobase/include/fil0fil.h. The following table provides a
brief description of the page types.

Page Type

Description

FIL_PAGE_INDEX

The page is a B-tree node

FIL_PAGE_UNDO_LOG

The page stores undo logs

FIL_PAGE_INODE

contains an array of
fseg_inode_t objects.

FIL_PAGE_IBUF_FREE_LIST

The page is in the free list
of insert buffer or change buffer.

FIL_PAGE_TYPE_ALLOCATED

Freshly allocated page.

FIL_PAGE_IBUF_BITMAP

Insert buffer or change
buffer bitmap

FIL_PAGE_TYPE_SYS

System page

FIL_PAGE_TYPE_TRX_SYS

Transaction system data

FIL_PAGE_TYPE_FSP_HDR

File space header

FIL_PAGE_TYPE_XDES

Extent Descriptor Page

FIL_PAGE_TYPE_BLOB

Uncompressed BLOB page

FIL_PAGE_TYPE_ZBLOB

First compressed BLOB
page

FIL_PAGE_TYPE_ZBLOB2

Subsequent compressed
BLOB page

Each page type is used for
different purposes. It is beyond the scope of this article, to
explore each page type. For now, it is sufficient to know that all
pages have a page header (page_header_t) and they store the page type
in it, and based on the page type the contents and the layout of the
page would be decided.

Tablespace
Header

Each tablespace will have a
header of type fsp_header_t. This data structure is stored in the
first page of a tablespace.

The table space
identifier (space_id)

Current size of the
table space in pages.

List of free extents

List of full extents not
belonging to any segment.

List of partially
full/free extents not belonging to any segment.

List of pages containing
segment headers, where all the segment inode slots are reserved.
(pages of type FIL_PAGE_INODE)

List of pages containing
segment headers, where not all the segment inode slots are reserved.
(pages of type FIL_PAGE_INODE).

From the tablespace header,
we can access the list of segments available in the tablespace. The
total space occupied by the tablespace header is given by the macro
FSP_HEADER_SIZE, which is equal to 16*7 = 112 bytes.

Reserved
Pages of Tablespace

As mentioned earlier, InnoDB
will always contain one tablespace called the system tablespace with
identifier 0. This is a special tablespace and is always kept open
as long as the MySQL server is running. The first few pages of the
system tablespace is reserved for internal usage. This information
can be obtained from the header storage/innobase/include/fsp0types.h.
They are listed below with a short description.

Page
Number

The
Page Name

Description

0

FSP_XDES_OFFSET

The extent descriptor
page.

1

FSP_IBUF_BITMAP_OFFSET

The insert buffer bitmap
page.

2

FSP_FIRST_INODE_PAGE_NO

The first inode page
number.

3

FSP_IBUF_HEADER_PAGE_NO

Insert buffer header page in
system tablespace.

4

FSP_IBUF_TREE_ROOT_PAGE_NO

Insert buffer B-tree root
page in system tablespace.

5

FSP_TRX_SYS_PAGE_NO

Transaction system header in
system tablespace.

6

FSP_FIRST_RSEG_PAGE_NO

First rollback segment page,
in system tablespace.

7

FSP_DICT_HDR_PAGE_NO

Data dictionary header page
in system tablespace.

As can be noted from above,
the first 3 pages will be there in any tablespace. But the last 5
pages are reserved only in the case of system tablespace. In the
case of other tablespaces only 3 pages are reserved.

When the option
innodb_file_per_table is enabled, then for each table a separate
tablespace with one data file would be created. The source code
comment in the function dict_build_table_def_step() states the
following:

/* We create a new single-table tablespace for the table.
We initially let it be 4 pages:
- page 0 is the fsp header and an extent descriptor page,
- page 1 is an ibuf bitmap page,
- page 2 is the first inode page,
- page 3 will contain the root of the clustered index of the
table we create here. */

File
Segments

A tablespace can contain many
file segments. File segments (or just segments) is a logical entity.
Each segment has a segment header (fseg_header_t), which points to
the inode (fseg_inode_t) describing the file segment. The file
segment header contains the following information:

The space to which the
inode belongs

The page_no of the inode

The byte offset of the
inode

The length of the file
segment header (in bytes).

Note:
It would have been really more readable (at source code level) if
fseg_header_t and fseg_inode_t had proper C-style structures defined
for them.

The fseg_inode_t object
contains the following information:

The segment id to which
it belongs.

List of full extents.

List of free extents of
this segment.

List of partially
full/free extents

Array of individual
pages belonging to this segment. The size of this array is half an
extent.

When a segment wants to grow,
it will get free extents or pages from the tablespace to which it
belongs.

Table

In InnoDB, when a table is
created, a clustered index (B-tree) is created internally. This
B-tree contains two file segments, one for the non-leaf pages and the
other for the leaf pages. From the source code documentation:

“In
the root node of a B-tree there are two file segment headers. The
leaf pages of a tree are allocated from one file segment, to make
them consecutive on disk if possible. From the other file segment we
allocate pages for the non-leaf levels of the tree.”

For a given table, the root
page of a B-tree will be obtained from the data dictionary. So in
InnoDB, each table exists within a tablespace, and contains one
B-tree (the clustered index), which contains 2 file segments. Each
file segment can contain many extents, and each extent contains 1MB
of consecutive pages.

Conclusion

This
article discussed the details about the data organization within
InnoDB. We first looked at the files created by InnoDB, and then
discussed about the various logical entities like tablespaces, pages,
page types, extents, segments and tables. We also looked at the
relationship between each one of them.

Tuesday Jan 15, 2013

This article discusses about the approach taken by InnoDB Storage Engine of MySQL to provide the repeatable read isolation level. First, an example is presented to demonstrate the two different designs that are possible. Then the design used in InnoDB is presented followed by a short discussion about the advantages and disadvantages of this design choice. As part of this discussion, we also present a performance optimization done in MySQL 5.6.

An Example Scenario

I used MySQL 5.5 for this purpose. Let us create the following tables t1 and t2 in the test database that is available by default. Even though the default storage engine in MySQL 5.5 is InnoDB, I explicitly specify it for clarity.

Consider the following scenario (transactions are started with REPEATABLE READ isolation level):

S. No.

Session 1

Session 2

1

start
transaction;

2

select f1 from
t1;

3

start
transaction;

4

update t2 set f1
= f1+1;

5

commit;

6

select f1 from
t2;

Please go through the above scenario and find out what would be the result set in Session 1, line 6, query “select f1 from t2”? Proceed once you know what you would expect.

By default, the isolation level in InnoDB is repeatable read. So in line 1 and 3 the transactions would be started with repeatable read isolation level. The output of the query in line 6 above will depend on the design approach taken to implement this isolation level. The two different design approaches possible are the traditional locking approach and the snapshot isolation technique.

In traditional locking approach, a transaction running in REPEATABLE READ isolation level would acquire a row lock on all the rows that has been read. No gap locks or range locks will be taken. These transactional row locks will be used to provide the semantics of the REPEATABLE READ isolation level. There are two drawbacks in this approach. One, the number of row locks taken can become very high and hence will be a performance problem. Two, this approach will not prevent phantom reads.

In InnoDB, a REPEATABLE READ isolation level is provided by making use of the snapshot isolation technique. This technique will be explained in the following sections.

If the traditional locking approach is taken then the output would be 2 (the new value). But in InnoDB the output of query “select f1 from t2” in line 6 would be 1 (the old value).

Creating a Consistent Read View

InnoDB creates a consistent read view or a consistent snapshot either when the statement

mysql> START TRANSACTION WITH CONSISTENT SNAPSHOT;

is executed or when the first select query is executed in the transaction. In the example given above, the consistent read view is created in Session 1 when the first select query in line 2 is executed. This consistent read view is at the database level, so further select queries in Session 1 will not see the modifications done by other transactions in any of the tables in the database. This is the reason why the “select f1 from t2” in Session 1 sees the old tuple instead of the new tuple as updated by Session 2. But Session 1 can see the modifications done by itself.

When a transaction creates a read view, it creates a read view (read_view_t) object and copies the list of all the active transactions into it. It uses this information later to provide a consistent view of the database as it existed when the read view was created. How this is done is detailed in the sections below. But before we proceed further, there is a small detail that we need to be aware of to understand how the read view works.

Hidden Fields of a InnoDB Table

For all InnoDB tables, 3 fields are internally added to the table – DB_ROW_ID, DB_TRX_ID and DB_ROLL_PTR (refer to InnoDB Multi-Versioning). They are explained below:

S. No.

System Column

Description

Length (bytes)

1

DB_ROW_ID

The row identifier

6

2

DB_TRX_ID

The identifier of the transaction identifier that created the
tuple

6

3

DB_ROLL_PTR

The rollback data pointer, pointing to previous versions of
the tuple in the undo logs.

7

Update: For tables, with explicit PRIMARY KEY or UNIQUE NOT NULL key, the DB_ROW_ID will not be stored in the row, even though it will be listed as one of the columns in the data dictionary object for the table (of type dict_table_t).

For our current discussion the focus is on DB_TRX_ID and the DB_ROLL_PTR. Each tuple or row contains the identifier of the transaction that created the tuple, and a list of previous versions of the tuple.

In InnoDB, the transaction identifier is an integer value and it is a monotonically increasing value. So using the transaction identifier one can determine which transaction started earlier and which started later.

How Consistent Read View Works

With the help of read view object and the hidden fields in the tuple, a transaction can construct the database as it existed at the time the consistent read view was created. For this purpose, the transaction using the read view has to determine whether it can "see a particular transaction". This is done using the following 2 rules:

Rule 1: When the read view object is created it notes down the smallest transaction identifier that is not yet used as a transaction identifier (trx_sys_t::max_trx_id). The read view calls it the low limit. So the transaction using the read view must not see any transaction with identifier greater than or equal to this low limit.

Rule 2: The transaction using the read view must not see a transaction that was active when the read view was created.

Whenever a tuple is accessed, the transaction using the read view will check if it can see the transaction that created the tuple (using the two rules mentioned above). If not, then check if the tuple has a previous version. If there are no previous versions then ignore the tuple and proceed to the next tuple. If there are previous versions, then repeat the procedure to find and build the correct version of the tuple to access from the undo log by following the DB_ROLL_PTR.

This avoids the phantom read problem in the REPEATABLE READ isolation in InnoDB.

Purpose of Read View

Why create the read view in the first place? To provide the repeatable read isolation level, creating a read view and later accessing the various versions of the row without locking has a performance benefit. The alternative would be to take a read lock on all the rows that was read in the transaction with repeatable read isolation level. Accessing a large table with millions of records would generate that many shared row locks. For performance gain a read view is created.

Performance Problem in MySQL 5.5

A transaction using the read view must decide whether to see the rows created by a particular transaction. This is done with the help of the two rules given above in the section “How Consistent Read View Works.” The first rule is simple and doesn't involve much copying. But the 2nd rule is expensive. Because of the 2nd rule, when the read view object is created, the complete list of active transaction identifiers are copied. The number of bytes copied depends on the number of active transactions in the database system when the read view is created.

The performance problem is not only because of the number of bytes copied during the read view creation, but also because of the actual traversal of the list of all the active transactions. This traversal is done under the protection of a mutex. So when there are many transactions trying to create a read view concurrently, the contention for this mutex becomes high.

The Read Only Optimization in MySQL 5.6

A recent discussion in InnoDB developers mailing list suggested that the read view creation can be a very expensive operation if the number of active concurrent transactions are very high (say in thousands). One of the optimizations that was discussed involved identifying the read-only transactions and ignoring them during read view creation.

Since read-only transactions do not make any changes that change the state of the tuple they are not required to be part of the read view. If the read-only transactions are ignored, then the number of transaction identifiers that needs to be copied during the read view creation can be reduced significantly. This will also reduce the mutex contention discussed earlier. This idea is based on the reasonable assumption that at any point of time, there will be a mix of read-only and read-write active transactions in a database system and that the read-only transactions will dominate the work load.

To facilitate this the Transaction Subsystem of InnoDB in MySQL 5.6 contains two lists of transactions, one for the read-write transactions (trx_sys_t::rw_trx_list) and the other for read-only transactions (trx_sys_t::ro_trx_list). For creating the read view only the list of read-write transactions are used. The length of transaction list to be traversed to create a read view is now smaller and the number of bytes copied is also less. This makes the read view creation faster and more scalable.

Conclusion

In this article, we saw how InnoDB provides the repeatable read isolation level using a consistent read view or consistent snapshot. Because of this approach, InnoDB is able to avoid the creation of many shared row locks and hence is able to achieve better performance. The phantom read problem is also avoided. We also saw how the consistent read view works.

We discussed a particular performance problem in MySQL 5.5 and presented an optimization done in MySQL 5.6, which makes the read view creation faster and more scalable. In certain highly concurrent situations, the current design of read view does have some performance challenges and MySQL team at Oracle is actively working to find solutions.

MySQL before the InnoDB Plugin

Traditionally, the MySQL storage engine interface has taken a
minimalistic approach to data definition language. The only natively
supported operations were CREATE TABLE, DROP
TABLE and RENAME TABLE. Consider the following
example:

You could imagine that the database could crash when copying all
rows from the original table to the new one. For example, it could run
out of file space. Then, on restart, InnoDB would roll back the
huge INSERT transaction. To fix things a little, a hack
was added to ha_innobase::write_row for committing the
transaction every 10,000 rows.

Still, it was frustrating that even a simple DROP
INDEX would make the table unavailable for modifications for a
long time.

Fast Index Creation in the InnoDB Plugin of MySQL 5.1

MySQL 5.1 introduced a new interface for CREATE INDEX
and DROP INDEX. The old table-copying approach can still
be forced by SET old_alter_table=0.

This interface is used in MySQL 5.5 and in the InnoDB Plugin for
MySQL 5.1. Apart from the ability to do a quick DROP
INDEX, the main advantage is that InnoDB will execute a
merge-sort algorithm before inserting the index records into each
index that is being created. This should speed up the insert into the
secondary index B-trees and potentially result in a better B-tree fill
factor.

The 5.1 ALTER TABLE interface was not perfect. For
example, DROP FOREIGN KEY still invoked the table
copy. Renaming columns could conflict with InnoDB foreign key
constraints. Combining ADD KEY and DROP KEY
in ALTER TABLE was problematic and not atomic inside the
storage engine.

The ALTER TABLE interface in MySQL 5.6

The ALTER TABLE storage engine interface was
completely rewritten in MySQL 5.6. Instead of introducing a method
call for every conceivable operation, MySQL 5.6 introduced a handful
of methods, and data structures that keep track of the requested
changes.

In MySQL 5.6, online ALTER TABLE operation can be
requested by
specifying LOCK=NONE. Also LOCK=SHARED
and LOCK=EXCLUSIVE are available. The old-style table
copying can be requested by ALGORITHM=COPY. That one will
require at least LOCK=SHARED. From the InnoDB point of
view, anything that is possible with LOCK=EXCLUSIVE is
also possible with LOCK=SHARED.

Most ALGORITHM=INPLACE operations inside InnoDB can be
executed online (LOCK=NONE). InnoDB will always require
an exclusive table lock in two phases of the operation. The execution
phases are tied to a number of methods:

handler::check_if_supported_inplace_alter

Checks if the storage engine can perform all requested operations,
and if so, what kind of locking is needed.

handler::prepare_inplace_alter_table

InnoDB uses this method to set up the data dictionary cache for
upcoming CREATE INDEX operation. We need stubs
for the new indexes, so that we can keep track of changes to the table
during online index creation. Also, crash recovery would drop any
indexes that were incomplete at the time of the crash.

handler::inplace_alter_table

In InnoDB, this method is used for creating secondary indexes or
for rebuilding the table. This is the ‘main’ phase that can be executed
online (with concurrent writes to the table).

handler::commit_inplace_alter_table

This is where the operation is committed or rolled back. Here,
InnoDB would drop any indexes, rename any columns, drop or add
foreign keys, and finalize a table rebuild or index creation. It
would also discard any logs that were set up for online index
creation or table rebuild.

The prepare and commit phases require an
exclusive lock, blocking all access to the table. If MySQL times out
while upgrading the table meta-data lock for the commit
phase, it will roll back the ALTER TABLE operation.

In MySQL 5.6, data definition language operations are still not
fully atomic, because the data dictionary is split. Part of it is
inside InnoDB data dictionary tables. Part of the information is only
available in the *.frm file, which is not covered by any
crash recovery log. But, there is a single commit phase inside the
storage engine.

Online Secondary Index Creation

It may occur that an index needs to be created on a new column to
speed up queries. But, it may be unacceptable to block modifications
on the table while creating the index.

It turns out that it is conceptually not so hard to support online
index creation. All we need is some more execution phases:

Set up a stub for the index, for logging changes.

Scan the table for index records.

Sort the index records.

Bulk load the index records.

Apply the logged changes.

Replace the stub with the actual index.

Threads that modify the table will log the operations to the logs
of each index that is being created. Errors, such as log overflow or
uniqueness violations, will only be flagged by the ALTER
TABLE thread. The log is conceptually similar to
the InnoDB change
buffer.

The bulk load of index records will bypass record locking. We still
generate redo log for writing the index pages. It would suffice to log
page allocations only, and to flush the index pages from the buffer
pool to the file system upon completion.

The keyword ALGORITHM=INPLACE is somewhat misleading,
because certain operations cannot be performed in-place. For example,
changing the ROW_FORMAT of a table requires a
rebuild.

Online operation (LOCK=NONE) is not allowed in the
following cases:

when adding an AUTO_INCREMENT column,

when the table contains FULLTEXT indexes
or a hidden FTS_DOC_ID column, or

when there are FOREIGN KEY constraints referring to
the table, with ON…CASCADE or ON…SET NULL
option.

The FOREIGN KEY limitations are needed, because MySQL
does not acquire meta-data locks on the child or parent tables when
executing SQL statements.

Theoretically, InnoDB could support operations like ADD
COLUMN and DROP COLUMN in-place, by lazily
converting the table to a newer format. This would require that the
data dictionary keep multiple versions of the table definition. For
simplicity, we will copy the entire table, even for DROP
COLUMN.

The bulk copying of the table will bypass record locking and undo
logging. For facilitating online operation, a temporary log will be
associated with the clustered index of table. Threads that modify the
table will also write the changes to the log.

When altering the table, we skip all records that have been marked
for deletion. In this way, we can simply discard any undo log records
that were not yet purged from the original table.

Off-page columns, or BLOBs, are an important consideration. We
suspend the purge of delete-marked records if it would free any
off-page columns from the old table. This is because the BLOBs can be
needed when applying changes from the log. We have special logging for
handling the ROLLBACK of an INSERT that
inserted new off-page columns. This is because the columns will be
freed at rollback.

In MySQL
5.6, we continued our development on InnoDB Memcached and completed a few
widely desirable features that make InnoDB Memcached a competitive feature in
more scenario. Notablely, they are

1)Support multiple table mapping

2)Added background thread to
auto-commit long running transactions

3)Enhancement in binlog performance

Let’s go over each of these
features one by one. And in the last section, we will go over a couple of
internally performed performance tests.

Support multiple table mapping

In our earlier release, all
InnoDB Memcached operations are mapped to a single InnoDB table. In the real
life, user might want to use this InnoDB Memcached features on different
tables. Thus being able to support access to different table at run time, and
having different mapping for different connections becomes a very desirable
feature. And in this GA release, we allow user just be able to do both. We will
discuss the key concepts and key steps in using this feature.

1) "mapping name" in
the "get" and "set" command

In order to allow InnoDB
Memcached map to a new table, the user (DBA) would still require to
"pre-register" table(s) in InnoDB Memcached “containers” table (there
is security consideration for this requirement). If you would like to know
about “containers” table, please refer to my earlier blogs in blogs.innodb.com.
Once registered, the InnoDB Memcached will then be able to look for such table
when they are referred.

Each of such registered table
will have a unique "registration name" (or mapping_name)
corresponding to the “name” field in the “containers” table.. To access these
tables, user will include such "registration name" in their get or
set commands, in the form of "get @@new_mapping_name.key",
prefix "@@" is required for signaling a mapped table change. The key
and the "mapping name" are separated by a configurable delimiter, by
default, it is ".". So the syntax is:

get
[@@mapping_name.]key_name

set
[@@mapping_name.]key_name

or

get
@@mapping_name

set
@@mapping_name

Here is an example:

Let's set up three tables in the
"containers" table:

The first is a map to InnoDB
table "test/demo_test" table with mapping name "setup_1"

INSERT
INTO containers VALUES ("setup_1", "test",
"demo_test",

"c1",
"c2","c3",
"c4", "c5", "PRIMARY");

Similarly, we set up table
mappings for table "test/new_demo" with name "setup_2" and
that to table "mydatabase/my_demo" with name "setup_3":

To switch to table
"my_database/my_demo", and get the value corresponding to “key_a”,
user will do:

get
@@setup_3.key_a

(this
will also output the value that corresponding to key "key_a"

or simply

get
@@setup_3

Once this is done, this
connection will switch to "my_database/my_demo" table until another
table mapping switch is requested. so it can continue issue regular command
like:

get
key_b

set
key_c 0 0 7

These
DMLs will all be directed to "my_database/my_demo" table.

And this also implies that
different connections can have different bindings (to different table).

2) Delimiter:

For the delimiter "."
that separates the "mapping name" and key value, we also added a
configure option in the "config_options" system table with name of
"table_map_delimiter":

INSERT INTO config_options
VALUES("table_map_delimiter", ".");

So if user wants to change to a
different delimiter, they can change it in the config_option table.

3) Default mapping:

Once we have multiple table
mapping, there should be always a "default" map setting. For this, we
decided if there exists a mapping name of "default", then this will
be chosen as default mapping. Otherwise, the first row of the containers table
will chosen as default setting.

Please note, user tables can be
repeated in the "containers" table (for example, user wants to access
different columns of the table in different settings), as long as they are
using different mapping/configure names in the first column, which is enforced
by a unique index.

4) bind command

In addition, we also extend the
protocol and added a bind command, its usage is fairly straightforward. To
switch to "setup_3" mapping above, you simply issue:

bind
setup_3

This will switch this connection's InnoDB table to
"my_database/my_demo"

In
summary, with this feature, you now can direct access to difference tables with
difference session. And even a single connection, you can query into difference
tables.

Background thread to auto-commit long running transactions

This is
a feature related to the “batch” concept we discussed in earlier blogs. This
“batch” feature allows us batch the read and write operations, and commit them
only after certain calls. The “batch” size is controlled by the configure
parameter “daemon_memcached_w_batch_size” and “daemon_memcached_r_batch_size”.
This could significantly boost performance.

However,
it also comes with some disadvantages, for example, you will not be able to
view “uncommitted” operations from SQL end unless you set transaction isolation
level to read_uncommitted, and in addition, this will held certain row locks
for extend period of time that might reduce the concurrency.

To deal
with this, we introduce a background thread that “auto-commits” the transaction
if they are idle for certain amount of time (default is 5 seconds). The
background thread will wake up every second and loop through every
“connections” opened by Memcached, and check for idle transactions. And if such
transaction is idle longer than certain limit and not being used, it will
commit such transactions. This limit is configurable by change “innodb_api_bk_commit_interval”.
Its default value is 5 seconds, and minimum is 1 second, and maximum is
1073741824 seconds.

With the
help of such background thread, you will not need to worry about long running
uncommitted transactions when set daemon_memcached_w_batch_size and
daemon_memcached_r_batch_size to a large number. This also reduces the number
of locks that could be held due to long running transactions, and thus further
increase the concurrency.

Enhancement in binlog performance

As you
might all know, binlog operation is not done by InnoDB storage engine, rather
it is handled in the MySQL layer. In order to support binlog operation through
InnoDB Memcached, we would have to artificially create some MySQL constructs in
order to access binlog handler APIs. In previous lab release, for simplicity
consideration, we open and destroy these MySQL constructs (such as THD) for
each operations. This required us to set the “batch” size always to 1 when
binlog is on, no matter what “daemon_memcached_w_batch_size” and “daemon_memcached_r_batch_size”
are configured to. This put a big restriction on our capability to scale, and
also there are quite a bit overhead in creating destroying such constructs that
bogs the performance down.

With
this release, we made necessary change that would keep MySQL constructs as long
as they are valid for a particular connection. So there will not be repeated
and redundant open and close (table) calls. And now even with binlog option is
enabled (with innodb_api_enable_binlog,), we still can batch the transactions
withdaemon_memcached_w_batch_size and
daemon_memcached_r_batch_size, thus scale the write/read performance.

Although
there are still overheads that makes InnoDB Memcached cannot perform as fast as
when binlog is turned off. It is much better off comparing to previous release.
And we are continuing optimize the solution is this area to improve the
performance as much as possible.

Performance
Study:

Amerandra
of our System QA team have conducted some performance studies on queries
through our InnoDB Memcached connection and plain SQL end. And it shows some
interesting results.

** The
“set” operation when implemented in InnoDB Memcached involves a couple of DMLs:
it first query the table to see whether the “key” exists, if it does not, the
new key/value pair will be inserted. If it does exist, the “value” field of
matching row (by key) will be updated. So when used in above query, it is a
precompiled store procedure, and query will just execute such procedures.

***
added
“–daemon_memcached_option=-t8” (default is 4 threads)

So we
can see with this“set” query, InnoDB
Memcached can run 4.5 to 9 time faster than MySQL server.

Table 2: Performance comparison
on Get operations

Connections

5.6.7-RC-Memcached-plugin ( TPS / Qps) with
memcached-threads=8

5.6.7-RC*

X
faster

Get
(QPS)

Get

8

42,000

27,000

1.56

32

101,000

55.000

1.83

128

117,000

52,000

2.25

512

109,000

52,000

2.10

With the
“get” query (or the select query), memcached performs 1.5 to 2 times faster than
normal SQL.

Summary:

In
summary, we added several much-desired features to InnoDB Memcached in this
release, allowing user to operate on different tables with this Memcached
interface. We also now provide a background commit thread to commit long running
idle transactions, thus allow user to configure large batch write/read without
worrying about large number of rows held or not being able to see (uncommit)
data. We also greatly enhanced the performance when Binlog is enabled. We will
continue making efforts in both performance enhancement and functionality areas
to make InnoDB Memcached a good demo case for our InnoDB APIs.

When InnoDB compresses a page it needs the result to fit into its predetermined compressed page size (specified with KEY_BLOCK_SIZE). When the result does not fit we call that a compression failure. In this case InnoDB needs to split up the page and try to compress again. That said, compression failures are bad for performance and should be minimized.

Whether the result of the compression will fit largely depends on the data being compressed and some tables and/or indexes may contain more compressible data than others. And so it would be nice if the compression failure rate, along with other compression stats, could be monitored on a per table or even on a per index basis, wouldn't it?

This is where the new INFORMATION_SCHEMA table in MySQL 5.6 kicks in. INFORMATION_SCHEMA.INNODB_CMP_PER_INDEX provides exactly this helpful information. It contains the following fields:

would reveal the most problematic tables and indexes that have the highest compression failure rate.

From there on the way to improving performance would be to try to increase the compressed page size or change the structure of the table/indexes or the data being stored and see if it will have a positive impact on performance.

MySQL 5.6 comes with significant improvements for the compression support inside InnoDB. The enhancements that we'll talk about in this piece are also a good example of community contributions. The work on these was conceived, implemented and contributed by the engineers at Facebook. Before we plunge into the details let us familiarize ourselves with some of the key concepts surrounding InnoDB compression.

InnoDB buffer pool will attempt to cache compressed pages like normal pages. However, whenever a page is actively used by a transaction, we'll always have the uncompressed version of the page as well i.e.: we can have a page in the buffer pool in compressed only form or in a state where we have both the compressed page and uncompressed version but we'll never have a page in uncompressed only form. On-disk we'll always only have the compressed page.

When both compressed and uncompressed images are present in the buffer pool they are always kept in sync i.e.: changes are applied to both atomically.

Recompression happens when changes are made to the compressed data. In order to minimize recompressions InnoDB maintains a modification log within a compressed page. This is the extra space available in the page after compression and it is used to log modifications to the compressed data thus avoiding recompressions.

DELETE (and ROLLBACK of DELETE) and purge can be
performed without recompressing the page. This is because the
delete-mark bit and the system fields DB_TRX_ID and DB_ROLL_PTR are
stored in uncompressed format on the compressed page. A record can be
purged by shuffling entries in the compressed page directory. This can
also be useful for updates of indexed columns, because UPDATE of a key
is mapped to INSERT+DELETE+purge.

A compression failure happens when we attempt to recompress a page and it does not fit in the fixed size. In such case, we first try to reorganize the page and attempt to recompress and if that fails as well then we split the page into two and recompress both pages.

Now lets talk about the three major improvements that we made in MySQL 5.6.

Logging of Compressed Page Images:InnoDB used to log entire compressed data on the page to the redo logs when recompression happens. This was an extra safety measure to guard against the rare case where an attempt is made to do recovery using a different zlib version from the one that was used before the crash. Because recovery is a page level operation in InnoDB we have to be sure that all recompress attempts must succeed without causing a btree page split. However, writing entire compressed data images to the redo log files not only makes the operation heavy duty but can also adversely affect flushing activity. This happens because redo space is used in a circular fashion and when we generate much more than normal redo we fill up the space much more quickly and in order to reuse the redo space we have to flush the corresponding dirty pages from the buffer pool.Starting with MySQL 5.6 a new global configuration parameter innodb_log_compressed_pages. The default value is true which is same as the current behavior. If you are sure that you are not going to attempt to recover from a crash using a different version of zlib then you should set this parameter to false. This is a dynamic parameter.

Compression Level:You can now set the compression level that zlib should choose to compress the data. The global parameter is innodb_compression_level - the default value is 6 (the zlib default) and allowed values are 1 to 9. Again the parameter is dynamic i.e.: you can change it on the fly.

Dynamic Padding to Reduce Compression Failures:Compression failures are expensive in terms of CPU. We go through the hoops of recompress, failure, reorganize, recompress, failure and finally page split. At the same time, how often we encounter compression failure depends largely on the compressibility of the data. In MySQL 5.6, courtesy of Facebook engineers, we have an adaptive algorithm based on per-index statistics that we gather about compression operations. The idea is that if a certain index/table is experiencing too many compression failures then we should try to pack the 16K uncompressed version of the page less densely i.e.: we let some space in the 16K page go unused in an attempt that the recompression won't end up in a failure. In other words, we dynamically keep adding 'pad' to the 16K page till we get compression failures within an agreeable range. It works the other way as well, that is we'll keep removing the pad if failure rate is fairly low. To tune the padding effort two configuration variables are exposed.

innodb_compression_failure_threshold_pct: default 5, range 0 - 100,dynamic, implies the percentage of compress ops to fail before we start using to padding. Value 0 has a special meaning of disabling the padding.

Ever
since its first release, we are continuing consolidating and developing InnoDB Full-Text
Search feature. There is one recent improvement that worth blogging about. It
is an effort with MySQL Optimizer team that simplifies some common queries’ Query Plans and dramatically
shorted the query time. I will describe the issue, our solution and the end
result by some performance numbers to demonstrate our efforts in continuing
enhancement the Full-Text Search capability.

The Issue:

As we
had discussed in previous Blogs, InnoDB implements Full-Text index as reversed
auxiliary tables. The query once parsed will be reinterpreted into several
queries into related auxiliary tables and then results are merged and consolidated
to come up with the final result. So at the end of the query, we’ll have all matching records on
hand, sorted by their ranking or by their Doc IDs.

Unfortunately,
MySQL’s optimizer and query processing
had been initially designed for MyISAM Full-Text index, and sometimes did not
fully utilize the complete result package from InnoDB.

Here
are a couple examples:

Case 1:
Query result ordered by Rank with only top N results:

mysql>
SELECT FTS_DOC_ID, MATCH (title, body) AGAINST ('database')

AS SCORE FROM articles ORDER BY score DESC
LIMIT 1;

In
this query, user tries to retrieve a single record with highest ranking. It
should have a quick answer once we have all the matching documents on hand,
especially if there are ranked. However, before this change, MySQL would almost
retrieve rankings for almost every row in the table, sort them and them come
with the top rank result. This whole retrieve and sort is quite unnecessary
given the InnoDB already have the answer.

In
a real life case, user could have millions of rows, so in the old scheme,
itwould retrieve millions of rows'
ranking and sort them, even if our FTSalready found there are two 3 matched rows. Apparently, the million
ranking retrieve is done in vain. In above case, it should just ask for 3
matched rows' ranking, all other rows' ranking are 0. If it want the top
ranking, then it can just get the first record from our already sorted result.

Case
2:Select Count(*) on matching records:

mysql>
SELECT COUNT(*) FROM articles WHERE MATCH (title,body)

AGAINST
('database' IN NATURAL LANGUAGE MODE);

In this case, InnoDB search can find matching rows quickly and will have all
matching rows. However, before our change, in the old scheme, every row in the
table was requested by MySQL one by one, just to check whether its ranking is
larger than 0, and later comes up a count.

In
fact, there is no need for MySQL to fetch all rows, instead InnoDB already had
all the matching records. The only thing need is to call an InnoDB API to retrieve
the count

The
difference can be huge. Following query output shows how big the difference can
be:

mysql>
select count(*) from searchindex_inno where match(si_title, si_text) against
('people')

Output
shows it only took InnoDB only 3 seconds to get the result, while the whole
query took 16 minutes to finish. So large amount of time has been wasted on the
un-needed row fetching.

The
Solution:

The
solution is obvious. MySQL can skip some of its steps, optimize its plan and
obtain useful information directly from InnoDB. Some of savings from doing this
include:

1) Avoid redundant sorting. Since InnoDB already
sorted the result according to ranking. MySQL Query Processing layer does not
need to sort to get top matching results.

2) Avoid row by row fetching to get the matching
count. InnoDB provides all the matching records. All those not in the result
list should all have ranking of 0, and no need to be retrieved. And InnoDB has
a count of total matching records on hand. No need to recount.

3) Covered index scan. InnoDB results always contains
the matching records' Document ID and their ranking. So if only the Document ID
and ranking is needed, there is no need to go to user table to fetch the record
itself.

4) Narrow the search result early, reduce the user
table access. If the user wants to get top N matching records, we do not need to
fetch all matching records from user table. We should be able to first select
TOP N matching DOC IDs, and then only fetch corresponding records with these
Doc IDs.

Performance
Results and comparison with MyISAM

The
result by this change is very obvious. I includes six testing result performed
by Alexander Rubin just to demonstrate how fast the InnoDB query now becomes
when comparing MyISAM Full-Text Search.

These
tests are base on the English Wikipedia data of 5.4 Million rows and
approximately 16G table. The test was performed on a machine with 1 CPU Dual
Core, SSD drive, 8G of RAM and InnoDB_buffer_pool is set to 8 GB.

Again,
table 3 to table 6 all showing InnoDB consistently outperform MyISAM in these
queries by a large margin. It becomes obvious the InnoDB has great advantage
over MyISAM in handling large data search.

Summary:

These
results demonstrate the great performance we could achieve by making MySQL
optimizer and InnoDB Full-Text Search more tightly coupled. I think there are
still many cases that InnoDB’s result info have not been fully taken advantage
of, which means we still have great room to improve. And we will continuously
explore the area, and get more dramatic results for InnoDB full-text
searches.