For modern database systems the main bottleneck is usually IO. In order to speed up our database we conducted a study to find the fastest IO system currently available. In this article we describe our approach and findings. The result: with our new database server filled with SSDs we can transfer the amount of data equivalent to nearly one DVD per second, 3.3GB/sec.

Our current SAS hard-disk based database server (6 disks, RAID10) was having an access time bottle-neck, too much concurrent database access caused the server to slow down considerably. In a previous article building-the-new-battleship-mtron we promised test results using our new hardware.

We can now finally show you some of the obtained test results and give you a taste of what SSD based filing systems will bring. Of course, we run into several firmware incompatibility issues using this emerging hardware, found motherboard PCI-E bottle-necks, discovered Linux kernel performance differences, software specific benchmarking problems and strange file system hickups, but most of this has been circumvented to produce the figures listed in this article.

Benchmarking was not a goal by itself, given the amount of hardware (which reflects a certain amount of hard cash), what would be the best performing – cost effective and fail save setup using this hardware?

Well, we think we figured that out. Where to start… first let us explain why we choose this hardware setup at the first place. And for those who want to jump to conclusions, be our guest.

SSD: they come in several flavors where the best taste is the SSD which is build using SLC memory; they last longer and generally perform better on writing. The downside: they are much more expensive. We already had some hands on experience with MTRON SSDs in the past so we choose the newer, better and faster MTRON 7500 Pro series. Of course there are Intel, MemoRight, OCZ and Transcend too but at that time of writing MTRON was the best performer on paper. We ordered 12 MTRON MSP7535-032 (32GB) and 4 extra MTRON MSP7535-064 (64GB) SSDs.

Raid controller: that should be a PCI-E x8 board to handle all the IO. Unfortunately, at the time of writing no PCI-e x16 boards where available (and not so many, if any, server motherboards do support PCI-E x16 slots). Cache and even more cache should be available (only useful when Battery Back Up Unit applied). And, this may sound odd, but specifically for us, the controller also needed to support SAS disks. Namely we wanted to re-deploy some of our SAS hard-disks for good old time’s sake. A logical choice was to go for the ARC1680IX-12. Not only did we already gained some experience on a test server with it’s smaller brother, the ARC1680IX-8 but the recent firmware release 1.45 put it nearly on par with the native SATA controllers of the same brand. Problematic for us is the fact that all current high performance RAID controllers have been designed with traditional hard disks in mind. This means that the RAID controller often becomes the bottleneck. To overcome this bottleneck we used not one, but two ARC1680IX-12s in parallel. The setup we used is depicted in figure A.

fig A. RAID05 dual controller setup

Server: the server motherboard selected is the Super Micro X7DWN+ that can hold up to 128GB of RAM and two Quad Core Xeon CPU’s. We equipped the board with two X5460 CPU’s (at that time the highest clocked Xeon, which comes in handy for CPU bound calculations) and a mere 48GB of RAM. The server will act as a database server, the more RAM the better so most of the data can then be hold in memory.

Casing: a couple of OS disks, legacy SAS hard-disks and 12 SSDs need a shelter, we have chosen the Super Micro SC846, a 4U server casing that nicely matches with our two 1680IX-12 controllers and comes with a redundant power supply, something which is truly enterprise worthy.

First thing we did was to install the controller cards, wired the backplane, inserted an ordinary hard disk for the OS, installed Debian on it and inserted all SSDs in random order into the cabinet. Then the Reboot, and… show time!

No, not yet, the event log of the Areca controllers showed many Time-Out errors. Doing the regular stuff like building RAID arrays seemed impossible let alone to create a file system on the newly created block-devices.

Research revealed that the hardware came with different firmware as we used before. We found out that the 64GB SSDs where shipped with firmware 0.18R1H3, the same we used in our test server, while the 32GB SSDs came with 0.19R1, which was new to us. The drives having firmware 0.18R1H3 did work properly indeed. So we first asked our supplier to ship us the newer firmware 0.19R1H2 available at that time (sometimes newer is better). We flashed the disks to 0.19R1H2 but still no good; a lot of Time-Out errors appeared in the event log again. Then we figured out that sometimes older is better and asked MTRON if it was possible to downgrade the firmware and luckily it was. We where provided with the DOWN.EXE executable and the proper MTRON.MFI and we flashed the SSDs back to 0.18R1H3. Bingo… all seemed to work fine now. We recently tried to upgrade some SSDs to firmware 0.20R1 but that didn’t work either.

Not only MTRON releases firmware on a regular bases, but so does Areca. We therefore upgraded the ARC1680IX-12s to the 1.46 firmware released on 23-01-2009 and the whole ritual was performed again; flashing the SSDs to the latest version 0.20R1. At first glance no Time-Out errors where noticed after the flash, however building RAID arrays slowed down nearly ten times and a filing system created on the resulting block-device performed over 20 times worse as expected. We are back to 0.18R1H3 again. Currently Areca and Mtron are investigating in a joint effort what causes these problems. Some hurdles are taken but we are still running and didn’t finish yet, we will keep you posted. The latest news is that Areca was able to reproduce the problems, but located the problem to be within Intel’s IOP348 firmware. Of course Areca is unable to fix anything there, so it’s now Intel’s turn.

The motherboard comes with plenty of PCI-E x8 slots and we installed the RAID controllers in the slots that would give the best airflow in the casing. Since we have two controllers we created identical arrays on both controllers (lets call them “legs”). Tests showed that there was a read performance difference of some 25% between those legs. Swapping the controller cards, and even the SSDs showed that the cards where performing okay and the SSDs where not to blame either. When looking in the manual we discovered that the PCI-E x8 slots we tried where being driven by different chip-sets. Our board comes with an Intel 631xESB/632xESB IO controller chip (south-bridge) and the 5400MCH (north-bridge) where the 631xESB/632xESB IO controller chip performs less. If you are in the situation of putting any piece of hardware in your server be aware, or better test, which slot to take.

Put in figures: using the 5400MCH we where able to do a multi-threaded sequential and random READ with a maximum total bandwidth of 1750MB/s, while the 631xESB/632xESB IO controller stopped at 1400MB/s. WRITING was not affected by different slots, since this is bottle-necked by the disks them-selfs.

This may be considered one of the most important aspects of deriving figures from a server; which benchmark software to use? Ideally that would be software for which a lot of results are already available for comparison. We evaluated some like: bonnie++, IOMeter, IOZone, bm-flash, Orion and some more, all with their own pros and cons.

We eventually settled down with bm-flash and IOMeter. bm-flash takes a fixed amount of time to finish and gives a quick impression on file system performance, while IOMeter with a default workload, is widely used so we have an abundance of figures for comparison. We dropped bonnie++ because under some conditions it will only show “++++” when no useful figures could be calculated. IOZone and Orion required a complete study to be interpreted.

In order to use IOMeter we had to drop Linux and installed Windows Server 2008. Dynamo (the test client) running on Linux and IOmeter GUI (manager) running on Windows are somehow incompatible and continuously caused crashes on our test server. For our escape to Windows Server 2008 we where stuck with NTFS.

For benchmarking Linux we chose for the default JFS file system because previous tests showed (in our case) that this used the least CPU overhead and performed better compared to ext3. One may go totally lost in file system tuning but we decided it was not doable to tune too many parameters at the time. Note that the mount options where set to have a minimum of writes to the disks:

In order to increase the bandwidth we wanted to make use of software striping and as a side effect doubling the amount of RAM cache. We created two identical legs on the controllers, they where plugged in equal performing lanes and installed the mdadm and lvm2 packages for Debian (Debian Lenny – Linux 2 2.6.26-1-amd64). For both mdadm and lvm2 we configured striped block-devices. Let the games begin….

Not to bother you with too many figures right now we discovered an interesting finding: block-devices created by mdadm performed about 10% better compared to lvm2. The performance on write bandwidth is quite similar but the number of IOPS and read bandwidth using mdadm is higher. When doing a survey on internet we found that more people complained about this performance difference and that file system alignment and read ahead settings should be properly set. But even when all is properly configured the difference remains. E.g we can do max 65000 IOPS using mdadm and only 59000 IOPS with lvm2, we can read (sequential single threaded) max 1050MB/s using mdadm and only 980MB/s by lvm2. These measurements were obtained with reasonable fresh/new SSDs. Later on performance degraded and there was seemingly no way to get the original performance back. The mdadm/lvm2 performance difference can be attributed to the Linux kernel: mdadm uses the md kernel module while lvm2 uses the dm kernel module and these are beyond our control. Due to system administrator support we will not use mdadm but will go with lvm2.

Note: we recently retested on Debian Lenny, now stable. The max read bandwidth (sequential single threaded) for both mdadm and lvm2 stopped at 990MB/s. The difference in IOPS remained (65000 mdadm v.s. 60000 lvm2) but where only noticeable for small block-sizes (512B – 1kB) and therefor of minor importance to us.

Notice the drop in performance that starts at writing the 64kB blocks.

Tests with IOMeter showed that under certain conditions the whole controller became unresponsive for several seconds, we even measured file system freezes in the order of 10 seconds (both reading and writing). We are still puzzling what causes these freezes but we assume that after heavy writing, and the controllers cache being fully filled up, the controller first writes some cached data to disk and only then starts working properly, we addressed Areca about this phenomenon what the intended behavior should be. We will keep you posted.

Those interested in hard figures probably are looking at the graphs already. We deliberately will not show all the test results, we graphed them where possible to have an overview of SSD performance in different RAID configuarions.

In short; we have 16 SSDs and two controllers that support a maximum of 12 SSDs each. We tested RAID0 and RAID5 but skipped RAID6 and RAID10 arrays for write performance on these was rather poor. We repeatitively tested 1 to 12 SSDs with RAID0 and 3 to 12 SSDs with RAID5 configuration. For lvm2 striped block-devices (using the two controllers with software RAID0) we repeatitively tested 2×4 to 2×8 SSDs per RAID0 and RAID5 configuration.

Emphasized are 8kB block-size figures because our database implementation is compiled to work with 8kB block-size.

Sequential read, write and file copy versus number of SSDs and RAID configuration

Conclusion: bandwidth seems to be limited by the number of IO requests that can be spawned by a single thread (process). Scaling up the number of threads from 10 to 40 hardly influences the total bandwidth. Using lvm2 software RAID0 has a slight performance penalty on random read.

As we can see in figures 18, 20 and 22, starting from a 256kB block-size, we finally reach our claimed 3.3 GB/sec. Although reading with 40 threads generally didn’t influence the results much, in this case however we already reached the mentioned 3.3 GB/sec consistently at a smaller block-size (128kB) using 40 threads. In figure 22 we show the results we obtained with 16 SSDs over 2 RAID controllers. However, the exact same read performance was also observed when using 8, 10, 12 and 14 SSDs over 2 RAID controllers, meaning read performance saturates quite rapidly and increasing the number of SSDs beyond a certain threshold doesn’t help to further improve read performance.

Figures 24 to 29 clearly show the awkward ‘gaps’ that we encountered multiple times during testing. The gaps shown here are of relatively minor severity (we saw ones that are much worse). As of now, we have no explanation for this and attribute it to compatibility problems between the Mtron SATA disks and the Areca SATA implementation (via the Intel IOP348 SATA stack).

RAID 5 – closer look at write performance

fig 30. random write only bandwidth in [MB/s] v.s. block-size and number of disks

Figure 30 zooms in on the write performance we measured for several RAID 5 configurations, using 10 threads. The general trend is clearly that more disks improve write performance, a trend that was not so clearly visible for read performance. However, the graph shown in figure 30 is also a cause for great concern. There are a couple of awkward gaps to be seen and especially the 2×6 configuration is problematic. The first few and few last measuring points of the 2×6 configuration are exactly on par with the 2×8 one, but in the entire mid section all performance is gone.

In the above mentioned tests we found that software striping outperformed any configuration on a single RAID controller. Therefor the intention is to configure a software RAID0 setup using lvm2 for performance improvement. We have chosen a hardware RAID5 setup of 5 SSDs and 1 hot spare, this gives us much better performance over RAID6 and offers redundancy. In case one SSD fails the RAID will rebuild itself using the hot spare in about 15 minutes. During this time window we are vulnerable for data loss but that is a rather small and acceptable risk.

The performance measured for this setup sits somewhere between the performance shown in figures 24, 26 and 25, 27. We will spare you another graph. What about the sequential read, write performance? We measured this using a simple dd command for a big file (69GB) to rule out the mere 8GB RAM cache and a small file (4.3GB) that only hits the cache. In practical situations inserts and updates on the database will only hit the RAM cache and reading from database will be a mixture of reading from the 48GB of OS cache, the 8GB of on board cache on the Areca and finally the SSDs.

SSDs perform great, specially for database servers, where lots of concurrent read and write operations are carried out. Tests show an overall performance improvement of ten times for our database server but a general performance improvement can not be given, it all depends on your file system usage.

When used wisely SSDs are not much more expensive then traditional hard-disks (In the future probably cheaper because you need less SSDs to outperform hard-disks) and consume less energy.

However, due to SSDs being a rather new technology a lot of testing is required. It thus takes some time before one can actually go into production and until we have solved or at least understand where the file system hickups originate from we will not go live with SSDs.

This benchmark was brought to you by Dennis Brouwer and Arjan Tijms of JDevelopment, an M4N team.

This entry was posted
on Friday, February 20th, 2009 at 5:42 pm and is filed under Hardware, SSD.
You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.

5 comments to “One DVD per second”

[…] A while ago we conducted a study to find the fastest IO subsystem that money can buy these days. The only restriction being that it should be build out of (semi) commodity hardware. That meant extremely expensive solutions where a sales representative needs to visit your office in order to get a price indication where off limits. We reported about our findings here: One DVD per second […]

[…] The primary mechanism to control this with is the eager/lazy mechanism. Mark a relation as eager and JPA will fetch it upfront, mark it as lazy and it will dynamically fetch it when the relation is traversed. In practice, both approaches have their cons and pros. Mark everything eager and you’ll risk pulling in the entire DB for every little bit of data that you need. Mark everything lazy, and you’ll not only have to keep the persistence context around (which by itself can be troublesome), but you also risk running into the 1 + N query problem (1 base query is fired, and then an unknown amount of N queries when iterating over its relations). If fetching 1000 items in one query took approximately as long as fetching 1 item per query and firing 1000 queries, then this wouldn’t be a problem. Unfortunately, for a relational database this is not the case, not even when using heaps of memory and tons of fast SSDs in RAID. […]