“Use innodb page compression. For some workloads (particularly those with lots of char/varchar/text data types) compression will allow the data to be more compact, stretching out that performance curve for longer. It may also allow you to more easily justify SSDs which are typically smaller in capacity. InnoDB page compression was improved a lot in MySQL 5.6, courtesy of Facebook providing a series of patches.”

After writing that, I decided to setup an experiment.

The Experiment

I wanted to find data that was typical to be stored in a database, but would also compress well. There is a huge potential for skew here, since if I used dummy data such as ‘AAAAAAAAAAAA’ it will compress very well. Likewise, jpeg images stored in blobs would unlikely compress any more than they already are. So I arrived at using the English version of Wikipedia, which is hopefully representative of “real data”.

So after downloading the data set and importing it from its XML format into MySQL, I ran the following script:

With the results, there’s really two aspects to measure compression by here – footprint and import time:

So for 4K and 8K pages, there is a >50% saving in disk space, since the table compresses down from 51G to 21G. In addition, it’s pretty clear that (at least on my hardware) the compression doesn’t have any impact on the import time: the reduced footprint actually helped the 8K compressed pages take less time to insert than the uncompressed.

So I believe I did manage to prove that compression is very useful. However I also acknowledge that there is some skew in my test:

INSERT SELECT is single threaded. Ideally a test should reflect a real-life requirement.

My single 5400 RPM hard drive does not reflect common configuration in production.

I also disabled the doublewrite buffer and set innodb-flush-log-at-trx-commit=2. It would be nice to demonstrate if this impacts the test.

I really should have waited for all dirty pages to flush between all tests. This was an oversight on my part, and I think this advantaged 16K uncompressed insert (which is just a little better than the others).

Today I wanted to take a look at improving the performance of tables that cause performance problems based largely on their size. Some of this advice also applies to databases that are large in-aggregate over many tables, but I always find the individually large table a special-case that is problematic.

What you will normally find is that the speed that the table can be modified will trend down as the size increases. Here is what I am going to call the typical B+Tree index performance over time:

This graph taken from a post by MySQL@Facebook. It shows the performance degradation of inserting one billion rows into a table with insert buffer disabled (not recommended, and used for demonstration purposes only). Note that this is in log scale!

So we should expect degradation of performance due to the structure of the index, but there are actually some ways that we can try and stretch out the curve, and not degrade as quickly.

Ten potential ways to reduce large table impact:

Make sure to use InnoDB instead of MyISAM. MyISAM can be faster at inserts to the end of a table, but it has both table locking (limiting updates and deletes) and uses a single lock to protect the key buffer when loading data to/from disk, resulting in contention. It also does not have the change buffering feature described just below.

InnoDB has change buffering (previously called the insert buffer), which is a feature to delay building secondary indexes that are not unique, and merge writes. It’s further described by Facebook here. It’s not shown in the graph above, but it can boost insert performance by quite a lot, and it’s enabled by default. It was greatly improved in MySQL 5.5, so it is time to upgrade if you haven’t.

Partitioning may reduce the size of indexes, effectively reducing the table
into many smaller tables. It also reduces internal index->lockcontention, something that has been greatly improved in the MySQL 5.7.2 DMR.

Use innodb page compression. For some workloads (particularly those with lots of char/varchar/text data types) compression will allow the data to be more compact, stretching out that performance curve for longer. It may also allow you to more easily justify SSDs which are typically smaller in capacity. InnoDB page compression was improved a lot in MySQL 5.6, courtesy of Facebook providing a series of patches.

Sort and bulk load data into tables. Inserting in order will result in fewer page splits (which will perform worse on tables not in memory), and the bulk loading is not specifically related to the table size, but it will help reduce redo log pressure.

Remove any unnecessary indexes on the table, paying particular attention to UNIQUE indexes as these disable change buffering. Don’t use a UNIQUE index if you have no reason for that constraint; prefer a regular INDEX.

Related to the points 5 & 6, the type of primary key also matters. It is much better to use either an INT or BIGINT datatype than say a GUID, which will have a curve that degrades much faster. Having no PRIMARY KEY will also affect performance negatively.

If bulk loading a fresh table, delay creating any indexes besides the PRIMARY KEY. If you create them once all data is loaded, then InnoDB is able to apply a pre-sort and bulk load process which is both faster and results in typically more compact indexes. This optimization became true in MySQL 5.5.

More memory can actually help here too. I frequently see people under spec memory on new database servers compared to what it actually costs these days. Simple advice: If SHOW ENGINE INNODB STATUS shows any reads/s under BUFFER POOL AND MEMORY and the number of Free buffers (also under BUFFER POOL AND MEMORY) is zero, you could benefit from more (assuming you have sized innodb_buffer_pool_size correctly on your server. See here.)

As well as memory, SSDs can help too. Much of the performance drop shown on the curve can be attributed to additional IO which is created as the table gets bigger. While a hard drive can do 200 operations per second (IOPS), a typical SSD will do 20K+

We use Amazon RDS for database servers and historically have had a practice of having smaller database server in QA and DEV to save cost. I would consider this a pretty common practice – but what I have also discovered is that it can lead to fixing non-bugs, or issues that will not arrise in production.

Let me try and describe a (~) worst case and some simple math to illustrate:

Production is serving its 10K queries/second with a 100% cache hit rate, but with the 1G less RAM, QA is only able to serve with a 99% cache hit rate. Each query touches on average 10 distinct pages. So:

.. since we only have 100 IOPS, we can only serve 1/10th of the queries!

In our case it was more like production having 68G RAM and QA having 34G. The query volume itself was also not necessarily as high, but the queries all needed to access a very large number of distinct pages – causing the same cache miss, and page churn.

With databases, choices of algorithms influence performance characteristics significantly. It is often very easy to find a situation where one database will perform much worse than others, which is why you will hear database engine developers cry out that benchmarks are flawed.

The best benchmark is one that closely matches what your application actually does, which is why you see the TPC create benchmarks to match hypothetical situations – like a warehouse that has inventory arriving and being shipped out all the time. These situations in turn create “workloads” on the database. To give some context, a workload may perform differently on a different database engine because of how many concurrent modifications are required, the ratio of reads/writes, how much data is modified in a transaction, and where in the data set the reads and writes are (are there hot records or hot tables?). These are just examples – in reality there are too many factors to list.

Which gets me to my main point – what workloads is MySQL with the InnoDB storage engine really good at? In answering this, I really want to try to focus on core data structures and algorithms more than specific features that may or may not be implemented.

Total data size may exceed memory but working set does not

If you have for example 100G of data, you do not necessarily require 100G of RAM. MySQL will adjust and keep only the most frequently accessed data in memory (aka working set). This behavior is the result of using a B+Tree index. To simplify and compare this to say memcached where the index is a hash – it really requires as much memory as there is data.

(Note: Memcached of course addresses this by allowing you to distribute the index across many nodes, and hash indexes are generally smaller, so please don’t miss the key point I am making here.)

Modifications are small and transactions are short

When you modify a row in InnoDB, the old version of it is relocated to undo space, and the ‘new’ version is written in place (implementing MVCC). When all existing transactions have finished with the old version of the row, along comes a purge process.

Not all databases implement MVCC this way. For example another way of implementing MVCC is to just append/write new rows in the same table-space. The advantage of the relocation method is that if there is sufficient memory to keep the undo pages in memory (and they are not required for very long) is that this relocation may be a logical relocation only. i.e. physical IO relocation doesn’t have to happen. However, in the case that there is not enough memory/the undo pages need to be evicted from the buffer pool, then I could see this potentially taking more effort. I believe rolling back large transactions may be more effort with InnoDB’s design as well – although I have not put much thought into it.

Primary key lookups are fast

InnoDB implements a Clustered Index on the primary key. Clustering in this case refers to a close arrangement/organization on disk, not multiple compute instances working together. This makes primary key lookups in InnoDB particularly fast – although admitedly probably not as fast as databases that impliment a hash index.

Indexed ranged lookups are supported

Some index structures are suitable for fixed primary key lookups, or emulate other types of access by pre-computing a cache of values and packing them in one row/column. Because InnoDB uses a B+Tree index, it is very flexible in being able to support ranged lookups on primary or secondary keys (example: find rows BETWEEN x AND y). These indexes are also pre-sorted, which (depending on the query) can allow the rows to be efficiently delivered as they are read rather than buffered first.

Query performance is highly uniform

InnoDB is optimized to provide stable response times to user-serving applications, rather than just peak throughput in queries per second. There are a number of features which help deliver this: converting as much IO as possible to be in the background with the transaction log files, adaptive flushing to make sure that background work doesn’t fall behind, even the implimentation of MVCC I described above.

InnoDB also supports covering indexes, which means that for queries that do not SELECT * they sometimes have a chance to retrieve all data from the secondary index. For some cases where not all data fits in memory covering indexes can be super critical to uniformity of performance because the number of unique pages that will need to be looked at is considerably lower.

Table has only a primary key and non-unique secondary keys

So I mentioned already that total data-size can exceed main memory, but InnoDB also has a novel mechanism to delay INSERT/UPDATE/DELETE modifications to secondary indexes if the pages are not loaded into memory. What it will do is buffer the changes in a secondary structure, and then make the modification the next time the page is loaded into memory, or in the background provided the server is not under load. This feature only works when index is not unique, because otherwise it is not safe to make a modification not knowing if it violates a constraint. Ideally, the primary key also inserts data in order.

(In previous versions of InnoDB, this feature was described as the insert buffer).

..

What am I missing? Please try and focus on the algorithms not the features, and the positive not the negative!

A friend wrote to me recently with a question. He was working on a method to ship application metrics to statsd on a 1 minute interval. He had three examples of how to estimate the number of rows in a table and he wanted to know the difference between them.

I have actually never thought of using this method! I don’t think it’s accurate though, since data length has deleted space + additional preallocated or overhead space. For example a page file is only 15/16ths in InnoDB. So as you can see the number it returns is just over 10% higher than the actual number of rows (9441296).

This method takes a number that InnoDB provides, which in this case is accurate to within 4% of the actual number of rows (estimating slightly over in this case). So the question is really about how efficient a count already provided is, and where does it come from.

InnoDB samples a number of random pages in the table, and then estimates the total rows for the whole table. By default this sampling is just 8 pages in MySQL 5.5 (configurable via innodb_stats_sample_pages), but is greatly improved in MySQL 5.6 – with 20 pages sampled by default. The option is now called innodb_stats_persistent_sample_pages – a reference to the new persistent statistics feature!

So based on it being a fixed number of pages to examine, it is also going to scale reasonably with table growth. Pro tip: It is quite possible it may look much slower as soon as the table does not fit in memory, since 8 random pages could mean > 8 random IOs.

Select count(1)

This requires an index scan of the primary key. It’s important to explain why that is, since this behavior differs from MyISAM. InnoDB supports MVCC which is an important feature to allow concurrent access to rows without having to need readers set locks blocking other users from writing. In a practical sense what this feature means, is that at any one point in time there will be multiple versions of a row. The actual count(1) will depend on the time your transaction started, and its isolation level.

This solution will not scale well as the number of rows in the table grows and while storage-engine development is outside of my expertise, I suspect it is unlikely that this will be improved in any future MySQL versions. My reasoning is that I can not see an easy way of maintaining multiple different pre-computed counts without introducing any new global locking or overhead – which is a big no-no in being able to scale on multiple cores/cpus.

Finally, of the three solutions, this is the only 100% accurate method to be able to tell the exact number of rows in the table.