Entries in ssd
(17)

I've been doing research onto the best use of flash SSD with the Oracle RDBMS over the past year and Quest has produced a whitepaper summarizing the findings. You can get it here.

Understanding the performance dynamics of SSD is critical in modern Oracle performance management. I'm organizing my work on SSD Oracle performance into an online reference that I hope to continually update as I learn more.

Finally, I'll be speaking on this topic at Oracle Open World in just a few weeks - here are the session details:

I was recently asked to provide advice on how best to accelerate Oracle database performance with flash SSD. In this post I’ll attempt to outline some thoughts on that.

There’s some thinking that SSD will soon replace spinning disk altogether - we just have to wait for the price to come down. However, I’m not so sure that this complete replacement is likely to occur in the immediate future. There are two competing forces at work:

The economies of “big data”, which drive us to reduce the cost per unit of storage ($$/TB): magnetic disk costs only about 5% of the cost of a high end SSD per terabyte.

The economies of high IO throughput, in which we are motivated to reduce the cost of providing a given IO rate. SSD (PCi flash) can generate IO rates at 1/20th of the cost of magnetic disks.

These two competing economies are not likely to change in the immediate future: magnetic disk can store large amounts of data much more cheaply than SSD; SSD can deliver very high IO rates more cheaply than magnetic disk. For most databases, it will not be cost effective to place the entire database on SSD – the best outcome will be obtained when we place parts of the database on SSD and some on magnetic disk.

SSD Performance basics

I’ve posted in the past on how SSD works with the Oracle flash cache, and I’ve presented on SSD performance at OOW. You might like to review those items. However, here’s a quick review of the basics:

When inserting data into an empty page, performance is slower, but not awful – around 250 microseconds

In order to update an existing page, a block erase is required – much, much slower – maybe 2000 microseconds

Enterprise SSD vendors - Fusion IO, Virident, etc - all have sophisticated algorithms to avoid the write penalty associated with SSD block erase operations. Amongst other techniques, they will maintain a pool of unused blocks. When a page needs to be updated it will be marked as invalid in the original location and moved to one of these blocks. Later on, garbage collection routines will clear up the invalid entries. The result is that updates don't always have to incur the high overhead of block erase operations.

Nevertheless, you want to avoid placing write intensive files on a flash disk, because as the disk fills up with modified blocks - and especially if the write rate exceeds the garbage collection capabilities - you might see the disk slow down dramatically.

To summarize:

Flash-based disks perform reads much faster than writes, and can suffer from performance degradation if subjected to intensive sustained write operations.

Options for SSD deployment on Oracle

Given the performance characteristics of SSD, how best to use SSD to boost Oracle performance? There are a few options:

Put the entire database on SSD

Place selected segments on SSD

Use the 11GR2 flash cache

Put temporary tablespace on SSD

Put the redo logs on SSD

Let’s look at each of these in turn:

Put the entire Database on SSD

This works great – if you can afford it. For most databases the cost of putting it all on SSD is too high. Placing data that is rarely accessed on SSD is not very cost effective, because you are paying a high cost per GB, but not getting any benefit from the relatively cheap IO rates.

Place selected segments on SSD

This is probably the best option if you don’t have enough SSD for the entire database, but it does require a bit of configuration. We identify the segments with the most IO, perhaps using a query on V$SEGMENT_STATISTICS:

We are looking for object that have a high read/write ratio, contribute to a significant number of reads, and are small enough to fit on our SSD. We create a tablespace on SSD, then move these objects to that tablespace.

If our most read intensive tables are massive, then they might not fit on the SSD. In this case we could consider partitioning them and placing the hottest partitions on SSD.

Use the 11GR2 flash cache

If you have 11GR2 and are on a compatible platform (Solaris or Oracle Enterprise Linux), then you can setup the Oracle DB flash cache. This is absolutely the easiest option – all you need to do is set a few parameters and bounce the database. The flash cache automatically caches the hottest blocks and usually provides an excellent performance improvement.

The improvement is not as significant as moving objects directly onto flash storage, since there still has to be magnetic disk IO getting things on and off. Also, the architecture tends to create a lot of writes to the SSD, which can challenge garbage collection. The chart below sums up a typical outcome – a huge boost from the flash cache, but not as big a boost as we could get by putting objects directly on SSD:

Put temporary tablespace on SSD

Temporary tablespace IO (from sorts and hash operations) can sometimes be the most significant form of database IO, and often the entire temporary tablespace will fit on the SSD. So should we consider putting temporary tablespace on the SSD?

I’m a bit reluctant to do this: by definition temporary tablespace IO is equal parts reads and writes, and since we are motivated to avoid write IO to the SSD, this seems like a dubious option. However, if your performance is absolutely dominated by temp tablespace IO and your SSD is tolerant of high write rates (eg has excellent garbage collection,etc) then it might be worth trying. After all, the SSD will definitely speed up the read side of temporary tablespace IO, even if the write IO does not get a significant boost.

Put redo logs on SSD

Although redo logs are very IO intensive, it’s almost all write IO and therefore this is probably not the best option. During sustained writes the Garbage Collection will probably break down and the disk might not end up performing that much better than a spinning disk. I'd not recommend this.

Conclusions

You can use SSD to boost database performance even if you don’t have enough for the entire database. Best options are:

Place segments with high read but low write rates directly on SSD

Use the SSD with the 11GR2 database flash cache (if you are on OEL or Solaris)

If temporary segment IO is absolutely your bottleneck – but only then – consider placing the temporary tablespace on SSD.

I don’t think using SSD for redo logs is a good idea. The high rates of sustained sequential write IO is not the ideal type of IO for SSD.

The word KEEP is a bit unfortunate, since it perhaps implies behaviour identical to the KEEP pool of the buffer cache. The KEEP cache is a separately managed area of specific size, so that you can control what proportion of your memory is allocated to “kept” blocks. The flash cache KEEP setting is more to prioritise caching – there seems to be no limit on how many blocks will be kept.

For the above three tables, if we read all the rows in each more or less simultaneously, we get the following outcomes (shown here in our lovely new Spotlight on Oracle flash cache drilldown):

Not surprisingly, where FLASH_CACHE is NONE, there is no flash cache storage. When flash cache is KEEP, all the blocks are kept in the flash cache, and consequently there is not enough room for all the blocks from tables where FLASH_CACHE=DEFAULT. However, unlike the KEEP buffer pool, there’s no way to limit the amount of blocks in the KEEP pool, and so you can quite easily fill up the cache with blocks form FLASH_CACHE=KEEP tables. Once that happens, those blocks will stay there forever.

If you have Spotlight (or at least the 7.5 version about to be released with Flash cache diagnostics), you can see this happening. Once the flash cache fills up, every new block introduced forces an old block out. This shows up as evictions in the following chart:

Oracle clearly evicts blocks from FLASH_CACHE=DEFAULT tables in preference to FLASH_CACHE=KEEP. Presumably (I haven't tested this) within each setting blocks are moved out using some form of Least Recently Used (LRU) or Not Recently Used algorithm.

KEEP should therefore be used only for relatively small tables. If you set FLASH_CACHE=KEEP for a large table (larger than the flash cache for instance) then you risk rendering the flash cache unusable for other tables. I hope Oracle chooses to implement a limit on KEEP size in a subsequent release. I think it would be far more useful if you could allocate multiple flash cache pools of a certain size as for KEEP and RECYCLE in the buffer cache.

Spotlight 7.5 – which includes flash cache diagnostics as well as ESX diagnostics, support for the 11gR2 parallel query queue and lots of other cool stuff - will be available next month, both on it’s own and as part of the TOAD DBA suite.

We’d normally regard this data loss as an unacceptable risk, so you would think that the best thing to do would be to turn the write cache off. This can be done with the following command:

hdparm -W 0 /dev/sdb

(assuming that /dev/sdb is the flash SSD).

Turning off the write cache – as you’d expect – reduces the write IO capacity of the device. Below we see the effect on two identical workloads:

These workloads involve SELECT and UPDATE operations on a table which is stored on a datafile directly on the SSD. There’s no db flash cache involved in this simulation.

The datafile write rate drops substantially and the work takes longer to complete, as we expect. But why does the read IO rate drop as well? The reason is because of free buffer waits.

As described in this post, when a IO subsystem has a higher read bandwidth than write bandwidth, then sessions may be able to add and update blocks in the buffer cache faster than the DBWR can clear them out. When this happens free buffer waits occur as sessions wait for buffers to be cleared.

We can see the free buffer waits in Spotlights event wait chart:

Disabling the write cache slows down disk performance somewhat, but it’s still a lot faster than a spinning disk. Furthermore, most workloads are not as update intensive as in my simulation so a lot of the time you won’t hit this problem. Nevertheless, it's important to realize that the X-26 has this write cache and that it may be artificially increasing write throughput at the cost of write safety.

One word of caution: I met a guy from Percona who told me that Intel doesn’t actually support the X-25 with the write cache disabled. This is a bit disturbing, since it implies that you can choose data safety or vendor support but not both!

Also, note that the write cache can be left enabled if the SSD is only being used for the 11GR2 database flash cache. In that configuration failed writes to the cache in the event of a disk failure will cause no harm: Oracle will detect that the cache has failed and will bypass the cache completely.

In this post I'm going to report on some performance comparisons I've recently conducted on using SSD flash storage for datafiles vs. using the new Oracle 11GR2 database flash cache.

It sometimes seems like I’ve been waiting for the the end of the spinning magnetic disk all my professional life. The technology is so ancient, so clearly limited and (argh) mechanical. So the fact that Solid State Disk (SSD) is becoming more an more practical as a database storage medium - and directly supported by Oracle in 11GR2 - is very exciting.

Using SSD as part of database storage can certainly yield big results, but it's important to understand the performance characteristics of flash SSD and make sure we don't use it inappropriately.

SSD comes in two flavours:

Flash based SSD uses the familiar flash technology that underlies the ubiquitous USB drives that have replaced floppy disks for small portable data storage. Flash RAM is cheap, provides permanent storage without battery backup and so has low power consumption.

DDR RAM based SSD uses memory modules that are not particularly different in nature from those that provide core memory for the server. This RAM is backed by non-volatile storage (disk or flash RAM) and internal batteries. In the event of a power failure, the batteries provide enough power to write the RAM memory to the non-volatile storage.

DDR SSD is too expensive (in terms of $$/GB) to be very useful to we database professionals today. But Flash based disk is becoming an increasingly viable alternative to magnetic disk.

Read, write and erase operations

Flash disk storage is organized into pages – typically 4K – and blocks – typically 256K. For read operations, the flash disk can return data very quickly from individual pages. Writing to a page is slower – maybe 10 times slower. However, it is only possible to write a page of data if there is an empty page in the block. If we need to write to a full block, it must first be erased. The Wikipedia entry on SSD gives the following figures for seek, write and erase times:

As a flash SSD fills with data, the percentage of simple block level writes that require an erase increases, and the write performance of the flash SSD degrades.

The TRIM API function

High end flash SSD support an API call TRIM, which allows the OS to proactively erase full blocks so that writes can complete with only a page-level IO. Most high end drives also support wear-leveling algorithms that move hot-pages around the device to avoid the risk of block level failures. Flash drives can only support a limited number of erase operations before a block becomes unreliable, but this is minimized providing the disk automatically moves hot pages around in physical storage.

MLC vs SLC

Cheap flash drives generally use Multi-Level-Cell (MLC) technology in which each cell can store two bits of data rather than SLC which can only store one bit. The effect of MLC is to increase density at the expense of performance, particularly write performance. MLC is also less reliable in terms of data loss. If you’re concerned about write performance – and you probably should be – then you should avoid MLC.

In general, if you want a high performance flash SSD - and why get one if it's not high performance - you should go for a SLC based flash SSD that supports the TRIM API and has good wear leveling capabilities. For my testing, I used an Intel X-25 E 32GB drive. It cost about $600 Australian (about $534 US dollars).

The problem with read/write speed disparities

Given that most databases do many more reads than writes, do we need to worry about the disparity between seek times and write times for flash SSD? The answer is definitely YES. For an Oracle database, a big mismatch between the ability to read from a device and the ability to write to the device can be particularly harmful when performing transactional activity through the buffer cache.

The issue relates to the caching of data within the buffer cache. If it’s very much easier to put blocks into the buffer cache than to write them out, then there’s a good chance that the buffer cache will fill up with dirty (unmodified) blocks and that free buffer waits will result. The figure below illustrates how free buffer waits can occur:

If you're using cheap flash drives, then the write speed is going to be much slower than the read speed and consequently you could end up with free buffer waits becoming your limited factor during high transactional activity.

Enter the Oracle Database Flash Cache

Oracle's database flash cache provides an alternative way of exploiting flash SSD. Instead of putting the entire datafile on flash, we use flash as a secondary cache. The flash cache can be massive and can speed up reads for frequently accessed blocks. But if the flash drive is busy, Oracle will simply decline to write to the cache. That way we can get some of the advantage of flash for read optimization without having to pay the write penalty.

I outlined the flash cache algorithms in this previous post, and here's the diagram I used there to outline the lifecycle of a block when the database flash cache is enabled:

The important part of this architecture is that the DBWR will only write the block to the flash cache if it is not overloaded. If the DBWR gets busy, then the flash cache write can be omitted and while this will reduce the efficiency of the cache, it prevents the buffer cache from filling up with modified blocks - so free buffer waits should not result.

Flash disk read performance

Let's see how this pans out in practice. Below we see the relative performance for 500,000 random indexed reads against:

A table on magnetic disk with no flash cache

A table on magnetic disk with flash caching

A table stored directly on flash disk

The magnetic disk was a Seagate 7200.7 80GB Barracuda drive, while the Solid State Disk drive was an Intel X-25 E 32GB drive which is a pretty high end SLC drive. In both cases, the tablespaces were created on raw partitions to avoid filesystem caching and the redo logs and other tablespaces were on a seperate drive. The database was restarted prior to each test.

Here's the read performance:

As you'd expect, storing the table directly on flash is fastest, since there are only read I/Os, and every read I/O is from the much faster flash SSD. Note that the flash cache gives us a lot of advantages as well - the number of database file IOs is reduced, and the reads from the flash cache are very fast indeed.

Update performance

Next I measured the performance of primary key based UPDATEs. The performance for such an UPDATE is a mix of read time - get the block into cache - and DBWR performance. If the DBWR can't write the modified blocks out fast enough then free buffer waits and/or write complete waits will occur.

What I expected was that the flash tablespace test would show a lot of free buffer waits, since in theory flash writes are so much slower than flash reads. I thought that using the flash cache would avoid that issue, and provide the best performance. However, the results surprised me:

I repeated the above tests about a dozen times with a variety of workloads, but the results were pretty similar in every case. While using the flash cache is better than no flash at all, it's still better to put the table directly on flash storage. The X-25 E was simply able to sustain a write rate much higher than I had expected - about 2,000 writes/sec. Intel claim - and now I believe - that they have sophisticated algorithms that avoid the write penalty usually associated with the flash SSD storage medium.

Note that it's actually the Flash Cache configuration that generates the most free buffer waits. The flash cache allows Oracle to achieve much higher logical read rates but - since the writes still go to the relatively slower spinning magnetic disk - the buffer cache often fills with dirty blocks.

It may be that over time the performance of the X-25 will reduce when all the blocks have been written to at least once Erase operations become more prevalent. However, if the TRIM API is working properly then this degradation should in theory be avoided. Note that TRIM support is not universal - not all SSDs and older versions of Windows may not support it.

Write complete waits: flash cache

In previous postings I noted that the DBWR will skip writing to the flash cache when busy, which should prevent flash cache activity being a direct cause of buffer busy waits. However, I noticed under some workloads that I would get a large number of "write complete waits: flash cache". For instance, the output below shows about 74% of elapsed time in this wait:

Write complete waits occur when a session wants to modify a block, but the DBWR is in the process of writing it out to the datafile. The flash cache equivalent occurs for the same reason, but instead of writing it to the datafile, the DBWR is writing it to the flash cache. This phenomenon can occur when individual blocks are being modified at very high rates; in this scenario, the chance that the block will be modified before the DBWR writes it out can be substantial.

Conclusion

I'm very impressed with the performance of the X-25 E flash SSD drive. If it can maintain it's high write throughput over time, then it offers a significant increase in throughput over traditional spinning disk or over the new 11GR2 database flash cache. However, a few words of warning:

I did not stress test the disk over a long period of time. Some flash drives exhibit more write latency as they wear in. In theory, the TRIM functionality of the Intel X-25 E should avoid that, but I did not verify it in these tests.

Flash drives can be an unreliable form of storage over long periods of time. Any disk can fail of course, but if you run a flash disk hot it will likely fail before a magnetic disk (though it will do more work during that period).

In my tests, my table was small enough to fit on the flash storage. Had it not, then using the database flash cache would have allowed me to get some benefits from the flash drive without having to buy enough storage to fit the whole database on the SSD.

I'm now even more enthusiastic about the use of flash drives. Providing you buy a high-performance drive (think SLC & TRIM) then you could consider placing entire tablespaces or database on these drives. If you can't afford top end flash drives for all your data, then the 11GR2 database flash cache can give you a pretty substantial performance boost.