MyRocks matches or beats InnoDB performance with much better efficiency

Configuration

I used my Linkbench repo and helper scripts to run sysbench with maxid1=1B, loaders=4 and requestors=16 so there will be 4+1 concurrent connections doing the load (4 for link/count tables, 1 for node table) and 16 connections running transactions after the load finishes. My linkbench repo has a recent commit that changes the Linkbench workload and results with that change are labeled new config while results without the change are labeled old config. I share both. The test pattern is 1) load and 2) transactions. The transactions were run in 12 1-hour loops and I share results from the last hour. The test server has 48 HW threads, fast SSD and 50gb of RAM available to the OS page cache and database.

Tests were run for MyRocks, InnoDB from upstream MySQL and TokuDB. The binlog was enabled but sync on commit was disabled for the binlog and database log. All engines used jemalloc. Mostly accurate my.cnf files are here.

MyRocks was compiled on August 15 with git hash 0d76ae. The block cache was set to 10gb. Tests were done without compression and then with no compression for L0/L1/L2, LZ4 for L3 to the next to last level and then Zstandard for the max level.

InnoDB was from upstream 5.6.35, 5.7.17 and 8.0.2. The buffer pool was set to 35gb. The performance schema was enabled. Compression was not used.

TokuDB was from Percona Server 5.7.17. The db cache was set to 10gb. Tests were done without compression and then zlib.

Load Results

All of the data is here and note that I did not adjust metrics computed from iostat to account for iostat overstating bytes written by up to 2X for RocksDB. The table below has a subset of the results for the new config.

MyRocks has the best insert rate and compression doesn't hurt the rate

There is a regression for InnoDB from 5.7 to 8.0.2

Write efficiency (wKB/i) is best for TokuDB and similar between MyRocks and InnoDB

CPU efficiency (Mcpu/i) is similar for MyRocks and InnoDB

Space efficiency (size) is best for MyRocks and TokuDB. Uncompressed MyRocks uses much less space than uncompressed InnoDB

These are results from the 12th 1-hour loop of the transaction phase. All of the data is here and note that I did not adjust metrics computed from iostat to account for iostat overstating bytes written by up to 2X for RocksDB. The table below has a subset of the results for the new config. I will explain them.

Friday, September 8, 2017

After a few weeks of sysbench it is now time for more complex workloads and the first one is Linkbench with a cached database and low-concurrency. I prefer to start with cached & low-concurrency configurations before trying IO-bound & high-concurrency.

tl;dr:

InnoDB from MySQL 5.6 had the best throughput

CPU efficiency is similar for MyRocks and InnoDB

There is a CPU regression from MySQL 5.6 to 5.7 to 8.x

Write efficiency was similar for all engines on the load test but much better for MyRocks and TokuDB on the transaction test.

Configuration

I used my Linkbench repo and helper scripts to run sysbench with maxid1=10M, loaders=1 and requestors=1 so there will be 2 concurrent connections doing the load and 1 connection running transactions after the load finishes. My linkbench repo has a recent commit that changes the Linkbench workload and results with that change are labeled new config while results without the change are labeled old config. I share both. The test pattern is 1) load and 2) transactions. The transactions were run in 12 1-hour loops and I share results from the last hour. The test server has 48 HW threads, fast SSD and 256gb of RAM.

Tests were run for MyRocks, InnoDB from upstream MySQL and TokuDB. The binlog was enabled but sync on commit was disabled for the binlog and database log. All engines used jemalloc. Mostly accurate my.cnf files are here but the database cache was made large enough to cache the ~10gb database.

MyRocks was compiled on August 15 with git hash 0d76ae. Compression was not used.

InnoDB was from upstream 5.6.35, 5.7.17 and 8.0.2. The performance schema was enabled.

TokuDB was from Percona Server 5.7.17. Compression was not used.

Load Results

All of the data is here and note that I did not adjust metrics computed from iostat to account for iostat overstating bytes written by up to 2X for RocksDB. The table below has a subset of the results for the new config. The results for the old config are similar.

InnoDB 5.6 has the best insert rate but there is a regression from InnoDB in 5.6 to 5.7 to 8.0.2

Write efficiency (wkb/i) is similar for all engines

CPU efficiency (Mcpu/i) is similar for MyRocks and InnoDB

ips wkb/i Mcpu/i size wMB/s cpu engine

54283 1.60 83 14 86.7 4.5 myrocks

64402 1.02 72 16 65.6 4.6 inno5635

56414 1.03 77 16 58.2 4.3 inno5717

42954 1.02 97 16 45.1 4.2 inno802

21611 1.42 179 14 30.7 3.9 toku5717

legend:

* ips - inserts/second

* wkb/i - iostat KB written per insert

* Mcpu/i - normalized CPU time per insert

* wMB/s - iostat write MB/s, average

* size - database size in GB at test end

* cpu - average value of vmstat us + sy columns

Transaction Results

These are results from the 12th 1-hour loop of the transaction phase. All of the data is here and note that I did not adjust metrics computed from iostat to account for iostat overstating bytes written by up to 2X for RocksDB. The table below has a subset of the results for the new config. I will explain them. The results for the old config are similar.

InnoDB 5.6 has the best transaction rate but there is a regression from 5.6 to 5.7 to 8.0.2

Write efficiency (wkb/t) is much better for TokuDB and MyRocks than for InnoDB

Thursday, September 7, 2017

I used sysbench to compare IO-bound table scan performance for MyRocks, InnoDB and TokuDB. Tests were run on a large server with fast SSD, Intel NUCs with SSD and an Intel NUC with disk. I call this IO-bound because for all tests the table was larger than RAM.

tl;dr

MyRocks can be more than 2X slower than for InnoDB.

InnoDB in 5.7 does better than in 5.6

TokuDB without compression is comparable to InnoDB without compression and does much better than InnoDB when prefetching is enabled.

Compression usually has a small impact on scan performance for MyRocks with zstd and a much larger impact for TokuDB with zlib. I wonder how much of this is a measure of zstd vs zlib.

Scans were usually slower for all engines after fragmentation but the impact was larger for MyRocks and TokuDB than for InnoDB.

Configuration

I used my sysbench helper scripts with my sysbench branch. For tests with X tables there was 1 connection per table doing a full scan and when X > 1 the scans were concurrent. The scan time was measured twice -- first immediately after the load and index step and then after many updates have been applied. The second measurement was done to show the impact of fragmentation on scan performance.

I repeated tests on different hardware:

48core.ssd - server has 48 HW threads, fast SSD and 50gb of RAM. Tests were done with 8 tables and 100M rows/table and then 1 table with 800M rows.

I compiled MyRocks on August 15 with git hash 0d76ae. The MyRocks tests were done without compression and with zstd compression (myrocks.none, myrocks.zstd). I did one test for MyRocks with a binary that did not use special instructions to make crc32 faster (myrocks.none.slowcrc) and learned that fast crc doesn't make a difference on this test. It would be a bigger deal for an IO-bound test doing point queries.

I used TokuDB from Percona Server 5.7.17. The TokuDB tests were done without compression and with zlib compression. I tried tokudb_disable_prefetching ON and OFF (toku5717.none, toku5717.none.prefetch), but I have been setting this to ON for my OLTP benchmarks because enabling it ruined some OLTP results.

I used InnoDB from upstream 5.6.35 and 5.7.17. The performance_schema was enabled. The InnoDB tests did not use compression.

Results
The results below list the number of seconds to scan the table(s) and the time relative to InnoDB from MySQL 5.6.35. For the relative time a value greater than 1 means the engine is slower than InnoDB. These values are reported for pre and post where pre is the measurement taken immediately after loading the table and creating the secondary index and post is the measurement taken after applying random updates to the table(s).

See tl;dr above for what I learned from these results.Large server

These are results from 8 tables with 100M rows/table and then 1 table and 800M rows/table on the large server.

Wednesday, September 6, 2017

This post compares MyRocks, InnoDB and TokuDB on the time required to load and index a table for sysbench.

tl;dr

MyRocks, InnoDB and TokuDB have similar load performance although there is a regression for InnoDB from 5.6 to 5.7 to 8.x

InnoDB create index is much faster starting in 5.7

Configuration

I used my sysbench helper scripts with my sysbench branch and configured it to create 1 table with 800M rows. The binlog was enabled but sync on commit was disabled for the binlog and database log. The sysbench client shared the host with mysqld. The host has 48 HW threads, 50gb of RAM for the OS and MySQL and fast SSD. The test table is larger than RAM but it will take me a few days to get details on that. The test was repeated for MyRocks, InnoDB and TokuDB. I continue to use the IO-bound setup as described previously.

I compiled MyRocks on August 15 with git hash 0d76ae. The MyRocks tests were done without compression and with zstd compression.

I used TokuDB from Percona Server 5.7.17. The TokuDB tests were done without compression and with zlib compression. I tried tokudb_disable_prefetching ON and OFF, but I have been setting this to ON for my OLTP benchmarks.

I used InnoDB from upstream 5.6.35, 5.7.17, 8.0.1 and 8.0.2. For 8.x I used latin1/latin1_swedish_ci charset/collation. The performance_schema was enabled. The InnoDB tests did not use compression.

Results
The load is in PK order and there are no secondary indexes. Engines have similar performance although there is a slow regression for InnoDB with each new release and there is a big regression from 8.0.1 to 8.0.2 which I hope will be fixed when 8.x approaches GA. The ratio is the time to load for the engine divided by the time to load for InnoDB from MySQL 5.6.35.

Create index performance has more diversity. The table is larger than RAM, some of it will be read from storage and engines with compression (MyRocks.zstd, toku5717.zlib) suffer from decompression latency. The prefetch option doesn't help TokuDB in this case. InnoDB create index performance got much faster starting in 5.7. The ratio is the create index time for the engine divided by the time for InnoDB from MySQL 5.6.35.

MyRocks is based on RocksDB and RocksDB is write-optimized, so why don't write-heavy workloads always run faster on MyRocks than on InnoDB? I will start with a summary of MyRocks versus InnoDB: MyRocks has better space & write efficiency, frequently has better write latency and sometimes has better read latency. Better space & write efficiency means you use less SSD and it will last longer.

Don't forget that better write and space efficiency with MyRocks can lead to better read latency. When a database engine does fewer writes to storage there is more IO capacity available for reads which is a big deal with disk and can be a big deal with slower SSD. When a database engine uses less space in storage then it is likely to cache more data in RAM and have better cache hit rates on reads. Many of the benchmarks that I run use uniform distribution for key generation and won't show the benefit from better cache hit rates.

With RocksDB write usually means a call to Put and Put is a blind-write. As shown by many benchmark results, a blind write can be very fast with RocksDB - insert data into memtable, optionally flush the WAL to the OS page cache, optionally force the WAL to persistent storage. But SQL update and insert statements usually need much more than a blind-write and the reads done by MyRocks can explain why some write-heavy workloads are faster with InnoDB. Things that get in the way include:

Pluggable storage engine APIs are slow to adopt blind-write optimizations. I don't blame MySQL and MongoDB for this because such optimizations are limited to write-optimized engines. But MyRocks and MongoRocks are here and I expect that WiredTiger/MongoDB will eventually have an LSM in production. M*Rocks engines can use the merge operator for this. I don't expect blind-write optimizations to ever be implemented for an update-in-place b-tree.

The modified row count must be returned for an update statement and that requires evaluation of the where clause. For RocksDB this requires reads -- from the memtable, maybe from the LSM tree, OS page cache and storage. Reads from storage and the OS page cache might require decompression. This is a lot more work than a blind-write. The usage of blind-write optimizations will result in statements that are update-like and insert-like because the application programmers must be aware of the semantic differences -- modified row count won't be returned, constraint violations won't be acknowledged. I think it is worth doing.

Unique constraints must be enforced for the PK and unique secondary indexes. With the exception of an auto-increment column, this requires a read from the index to confirm the value does not exist. In the best case updates & inserts are in key order and the structures to be searched are cached in RAM -- otherwise this requires reads from the OS page cache and/or storage and might require decompression.

Secondary indexes must be maintained as part of the update & insert statement processing. For an insert this means that a new index entry will be inserted. For an update that requires index maintenance the existing index entry will be deleted and a new one will be inserted. With a b-tree the leaf page(s) will be read (possibly from storage), made dirty in the buffer pool and eventually those pages will be written back to storage. With RocksDB non-unique secondary index maintenance is read-free and blind-writes are done to put the new index entry and possibly delete-mark the old index entry. See the previous paragraph if there is a unique secondary index.

Tuesday, September 5, 2017

MyRocks does worse than InnoDB and the difference is larger on read-heavy tests.

InnoDB QPS tends to decrease after 5.6.35

InnoDB range scans are more efficient after 5.6.35

Configuration

I used core i5 Intel NUC servers with an SSD, my sysbench helper scripts and sysbench branch. The test was run with 4 tables and 1M rows/table. Somewhat accurate my.cnf files are here. The test was run for MyRocks compiled on August 15 with git hash 0d76ae and InnoDB from upstream 5.6.35, 5.7.17 and 8.0.2. The binlog was enabled but sync on commit was disabled for the binlog and database log. InnoDB 8.0.2 used latin1 charset and latin1_swedish_ci collation to match earlier releases. The sysbench client and mysqld shared the host.