Evaluating IO subsystem performance for MySQL Needs

I’m often asked how one can evaluate IO subsystem (Hard drive RAID or SAN) performance for MySQL needs so I’ve decided to write some simple steps you can take to get a good feeling about it, it is not perfect but usually can tell you quite a lot of what you should expect from the system.

What I usually look for MySQL is performance in random reads and random writes. Sequential reads and writes are rarely the problem for OLTP workloads, so we will not look at them.

I also prefer to look at performance with O_DIRECT flag set to bypass OS cache. This may execute separate code path in kernel and so has a bit different performance pattern compared to buffered IO (even followed by fsync regularly) , but it allows to easily bypass OS cache both for reads and for writes and so does not require creating large working sets for boxes with significant amounts of memory (or reducing amount of usable memory).

The system I’m testing this on has 256MB BBU (Battery Backed up Cache) on RAID controller so we will test two workloads size – first one is small which fits in the cache and the second one will be large enough so it does not. This allows us to see both by RAID cache and uncached IO performance.

We’re interested in cached reads because they may show latency to the RAID cache if RAID read cache is enabled. If it is disabled you may be reading from Drives cache or even drives themselves which will affect performance significantly. In any case especially playing with data size a bit you will well learn how your cache behaves.

Were even more so interested about writes which show us how many cached writes per second we can do – this is important for log writes, which are synchronous IO operation database performance which typically have rather close data locality.

We’re also interested in uncached reads and writes because this correspond to general database workload.

It is worth to run the test with 1 thread and with some higher number (say 64) to see how things scale.

The tool I’m using for this is SysBench which was designed by my team when I still worked for MySQL and we specially implemented bunch of tests to ease hardware evaluation for things which are important for MySQL.

To prepare small 128MB single file working set we can use the following command:

./sysbench--test=fileio--max-time=60--max-requests=1000000--file-num=1--file-extra-flags=direct--file-fsync-freq=0--file-total-size=128M--file-test-mode=rndrd run

sysbench v0.4.8:multi-threaded system evaluation benchmark

Running the testwith following options:

Number of threads:1

Extra fileopen flags:16384

1files,128Mbeach

128Mbtotal filesize

Block size16Kb

Number of random requests forrandom IO:1000000

Read/Writeratio forcombined random IO test:1.50

Calling fsync()atthe endof test,Enabled.

Using synchronousI/Omode

Doing random readtest

Threads started!

Timelimit exceeded,exiting...

Done.

Operations performed:773835Read,0Write,0Other=773835Total

Read11.808GbWritten0bTotal transferred11.808Gb(201.52Mb/sec)

12897.22Requests/sec executed

Testexecution summary:

total time:60.0001s

total number of events:773835

total timetaken by event execution:59.0285

per-request statistics:

min:0.0001s

avg:0.0001s

max:0.0086s

approx.95percentile:0.0001s

Threads fairness:

events(avg/stddev):773835.0000/0.00

execution time(avg/stddev):59.0285/0.00

The output is pretty verbose so it is quite scary for many people, however there is basically one number important here 12897.22 req/sec – so the RAID cache on this drive can do about 13.000 16K reads/sec from the cache, not bad at all.

With 64 threads I get 23727.46 req/sec which shows even in such cached mode you can get better performance by having many outstanding requests.

Running with more than one threads you may also take a look at another table SysBench displays:

Shell

1

2

3

4

5

per-request statistics:

min:0.0008s

avg:0.0027s

max:0.0112s

approx.95percentile:0.0030s

So we get about 3ms 95 percentile request time – not bad – all request were executed with pretty uniform performance.

Let us now do the test with single thread and write-through RAID cache (as if there would not be any BBU)

Auch… We get 142.62 Requests/sec which is extremely poor considering this is 8 drives array. You would likely get a bit better with sequential log writes but it is not going to be much higher.

Lets see what we get with 64 threads… hm again 142.83 Requests/sec and looking at VMSTAT I can see only 1 blocked process all the time which does not sounds right.

Honestly I did not expect this result while running benchmarks for this articles (and this is always very exciting to find something unexpected while doing Benchmarks). Fortunately I’ve already seen this some time ago though I hoped this issue is long fixed since that time… Though this box is running CentOS4 which is not the newest OS out where.

The problem is basically in case of O_DIRECT writes there is only one IO can be happening per file at any given time. To show this is the problem let us create working set consisting out of number of files and repeat the run:

Now we get 827.94 Requests/sec requests per second, which is not bad considering this is RAID5 volume.

Let us now repeat the tests for WriteBack cache configuration:

We get 4735.27 Requests/sec from 1 thread and single file; 4740.19 Requests/sec from 64 threads and 1 file and 5228.68 Requests/sec with 64 threads and 128 files.

As you can see numbers are a lot better especially when there is single thread doing IO (just as with database log). We can see serialization with O_DIRECT however the impact is much less in this case compared to Write Through mode because we’re mostly limited by RAID controller capacity.

Let us now do test with larger test set. We’ll use 2GB in 128 files (because we already discovered synchronization issue with O_DIRECT). This sounds small but as controller cache is just 256M it should be good enough.

There is also one trick running such larger scale benchmarks. SysBench by default has repeatable sequence of requests, because it does not initialize random generator. If you’re doing multiple runs make sure to add –init-rng=1 option otherwise results can be skewed.

For single thread we get 259.28 Requests/sec which approximately matches what single drive can do. This is important lesson – even if you have many hard drives you can’t really get advantage of them unless you can get multiple outstanding requests.

Performance with 64 threads was 2334.63 Requests/sec which is perfect scalability to 6 hard drive this array has. We have even a bit more than 8x because 64 outstanding requests allow RAID controller to optimize seeks.

As you can see we got more writes for single thread than we had reads and also using multiple threads for writes did not improve performance as dramatically as for reads. This is because as we have BBU writes are just stored in controller cache and flushed to the disk in background – in this case there is little difference if we use one thread or multiple threads.

As you also can see BBU is helpful even for random writes if they are performed from single thread – so it does not only help database log writes but has much larger positive impact on workload.

I should also note it is very good idea to watch VMSTAT while doing the test. Here is the snipped from good VMSTAT from the read test:

Shell

1

2

3

4

5

6

0649056031964928683212631072003776023340269320126424

064905603196500868361263106800372222633287694017524

06490560319643686836126310680036998633407716017424

064905603196428868401263106400373951233207774017424

06490560319643686840126310640037498633657758017424

06490560319643686840126310640037702633707680027424

As you can see all 64 threads are waiting for IO. Now lest see our write test which seems to have some serialization issues

Shell

1

2

3

4

5

2399056053055568692010534204000934516254587007030

119056053055488692010534204000901716614934007426

119056053055568692410534200000922316384881007327

139056053055568693210534192000963417155033007426

119056053055568693210534192000923516435035007326

You can get similar information by analyzing iostat numbers and that well may be more accurate. Though looking at VMSTAT is often good enough to see the problem.

P.S If you’re wondering which hardware I used for benchmarks – it is Dell PowerEdge 2950 with PERC5 and 6 10K RPM hard drives in RAID5.

Related

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Battery Backed Caches are really a nice feature. Some time ago we were wondering why one of our test servers performed abysmally compared to the other few, even though they were supposed to be identical.
Only after some time we realized that that particular one had been sent without a battery pack installed…

Absolutely. I still being surprised how some people buy 10K+ servers with RAID with 10+ hard drives and do not buy BBU. It does not cost a lot these days and even if your load is read only you’ve got to load data or restore from backup sometime.

Does this mean that *not* using O_DIRECT could be the right way to go, contrary to popular opinion, if you want to have multiple I/O to a single file in flight? With InnoDB, even with file_per_table, I would think you’d want multiple I/O to the iblog* files and probably ibdata*, not to mention busy tables.

As you can see if there is BBU the serialization with O_DIRECT does not hurt that much for large working set and small one mainly correspond to log writes which are done serially anyway.

Log files are only written to the end anyway so large number of them does not help.

If you do not have BBU when O_DIRECT already can be slowing your down because Innodb performs writes in background basically flushing pages one after another.

It would be interesting to check it with RHEL5 if serialization is still in place. I thought it is already long gone… I reported it last time to kernel Developers and they told me it will be fixed. It was long ago.

The IO scheduler matters for this test as well, even when using O_DIRECT-deadline should win. What is your IO scheduler setting? Most real-world cases the Cache on the controller is not a win and is really just used for making sure outstanding writes are not lost on crash. How did you fully remove filecache from the equation to make sure that the 256 MB test is not skewed by the system cache?

So, why choose RAID-5 for this test, when RAID-10 is perfect for random IO? For testing disks set-up to get theoretical throughput, have you tried RAW device writing?

What’s your method for achieving theoretical throughput to verify your test is testing the RAID-setup itself and not the various OS Layers in front of the disks?

First I should say the point of this article was not to show the best configuration out where but instead to show how you can measure performance and what numbers you should look at.

I just used the box which I had available for test which happened to have this OS setup or RAID level.

This box used cfq scheduler which works pretty well for workloads like this.

Honestly I’m surprised by the claim cache on controller is not a win. What particular workload you’re speaking about and what storage engine ? For MyISAM it well may be the case because it does not use synchronous IO but even in this case it is going to help the journaling file system you’re likely to use.

I am also not sure what do you mean by outstanding writes are not lost on the crash. Which outsanding writes ? To the OS device, to the RAID controller or to the disk drive ? Buffered or Unbuffered.

If you have buffered writes which are going to OS cache they can be lost on power down anyway. If you’re speaking about unbuffered IO to device (O_SYNC, O_DIRECT) they must not be lost after call returns. If they are you have configured something incorrectly (ie forced Writeback cache mode on the RAID without BBU) and this is exactly what should be avoided as loss in this case will corrupt Innodb database and even with MyISAM you’re risking journaling file system corruption.

File system cache is removed from equation because of use of O_DIRECT which bypasses OS cache for reads and writes. 256MB cache on controller can still potentially skew read results for 2GB tests if read cache is enabled on controller but this is not that large.

Regarding theoretical number I usually just look at the specs and see how close we get. Even if you test raw RAID partition you’re not testing drives you’re testing RAID controller with its drivers anyway. This is why I prefer to test the whole stack and see how it makes sense.

Honestly Iâ€™m surprised by the claim cache on controller is not a win. What particular workload youâ€™re speaking about and what storage engine ?

INNODB with 200 GB of data.

The memory on the controller helps, but I’m not going to get enough read or write performance from it where it helps end query performance of a mysql database. On a six disk RAID-10 15K RPM system, 22 MB/sec of random reads in a mixed read/write setup is just as good as it gets, the biggest win I receive from the battery backed cache is knowing that the last IO-OP from the controller made it to the disk, as well as flushes are less expensive. So, having a 256 MB controller in not going to reduce Response time for an application when doing filesorts or temp tables – which many people think.

Also just for fun here is a good link on what Linus Torvalds thinks of O_DIRECT.

First 200GB Innodb does not tell me much about workload – how many read writes and also how did you figure out what there is no gain from having BBU ?

In your case it is well possible Reads are what limits you in this case you would not get much benefit from BBU.

The benefits from BBU comes in number of ways – your Innodb log writes will be faster (of course if you do not care much about transactions and running with innodb_flush_logs_at_trx_commit=0 it does not apply much) and it will make flushes from buffer pool faster and also give RAID more room to optimize requests so writes would impact reads.

Temporary tables and Sort files are absolutely unrelated to BBU because these use MyISAM and so buffered IO so writes to them can be cached (and delayed) by operating system. Though good way to optimize them to keep them on tmpfs unless you have too large sorts to be able to do so.

Regarding O_DIRECT and Linus I know, so what ? The point is what works right now not what in theory would be best solution.

Linus feels PageCache should be used to cache data instead of Innodb Buffer Pool, well this first would require to use mmap IO because you can’t pay system call penalty for access to cached data, which makes it quite tricky to handle IO failure operations. But what is even more DataBases need very strict control about flush order which PageCache does not provide.

So the way which Linus thinks is a good way to go is not supported by MySQL/Innodb right now and neither can be in the near future.

I run sysbench on our new hardware before and did not notice much difference between cfq and deadline schedulers. Here is an article from redhat’s website . http://www.redhat.com/magazine/008jun05/features/schedulers/ And it is probably why cfq is RHEL’s default scheduler now.

We had a couple of DB servers’ disk controller battery dead and they started to crawl. So BBU is a must-have for DB servers.

I tested this on a raid 10, serverraid 8i with 256M and got exactly 12897.22 Requests/sec executed on some runs and not others on the read test. It seems wierd to get the exact number. It’s intresting to see the controler diffrence with the various cache’s on and off, I assumed read cache off would be better for mysql?

@ Peter: I totally agree of the benefits of BBU-I am agreeing with you, I’m just injecting that having 256MB of cache on a controller is not going to give any more throughput (T) and it will not solve problems that schema design will.

I’m doing around 200K selects (various large ranges, primary key lookups, secondary key lookups) per second across a few mysql servers, with 70% read / 30% write ratio – without using O_DIRECT (can’t because its broken in my kernel version). The BBU helps, but to attain such speed the RAID setup and schema-use is where I am able to attain 22MB a sec of random read I/O and 10 MB a sec of random write I/O on 3 effective spindles for writes and 6 for reads-with less then 10% of the CPU in WIO.

Recently the cache was turned off and there was a hit on performance but that was related to flushing of IOPS so the WIO went up increasing load – yet I was still able to sustain 22MB a sec of random I/O. This is the behavior I noticed (which was about a year ago)

So I agree that BBU is needed for better responsiveness R(t), but it’s not directly but indirectly associated with throughput (T).

I would love to see the same test with different cache profiles that the PERC-5 offers. Also a test with the cache turned off to see the difference.

[…] Evaluating IO subsystem performance for MySQL Needs | MySQL Performance Blog ‘m often asked how one can evaluate IO subsystem (Hard drive RAID or SAN) performance for MySQL needs so I’ve decided to write some simple steps you can take to get a good feeling about it, it is not perfect but usually can tell you quite a lot of what yo (tags: benchmark database disk mysql performance storage sysadmin io) […]

My experience is that with a BBU on a RAID unit (whether controller or disk stack), the cfq elevator comes with a ~20% performance penalty. Why? Because unless you’re exposing the spindles to the OS one-by-one (or in RAID-1 pairs striped with MD/LVM), you’re having cfq order all the random reads/writes to the disk within one I/O queue, practically guaranteeing only a maximum of two, perhaps three of your n spindles is performing any useful work.

The noop scheduler, on the other hand, lets the requests go to the RAID (and BBU cache in the case of writes) in random order, letting it utilize all spindles to their best effect.

I’m glad you asked, Peter. Since the explanation is on the longish side, I just posted it on my blog at http://www.fishpool.org/post/2008/03/31/Optimizing-Linux-I/O-on-hardware-RAID. Would love to hear your comments.

Very very interesting post! It help me a lot to find what cause a High IOwait problem, resulting poor performance. I was not familliar with SysBench and now I use it on every servers I manage. I love it.

I use RAID 10 with no BBU. Intense writing is slower on it than my old server with a single Disk, no raid. The performance also decreases when I tried O_DIRECT to flush the log so I switched back to the default. I didn’t try with BBU, but if I can’t get all the thing faster, i will try (and buy a battery).

Writes can be slower with O_DIRECT if you do not have BBU. However it is often not that bad as log write will require single IO anyway and the optimization for dirty page flushes may not affect subjective performance because it is happening in background.