Prior to storage server software version 11.2.3.2.0 (associated with Exadata X3), Exadata Smart Flash Cache was a “write-through” cache, meaning that write operations are applied both to the cache and to the underlying disk devices, but are not signalled as complete until the IO to the disk has completed.

Starting with 11.2.3.2.0 of the Exadata storage software[1], Exadata Smart Flash Cache may act as a write-back cache. This means that a write operation is made to the cache initially and de-staged to grid disks at a later time. This can be effective in improving the performance of an Exadata system that is subject to IO write bottlenecks on the Oracle datafiles.

Writes to datafile generally happen as a background task in Oracle, and most of the time we don’t actually “wait” on these IOs. That being the case, what advantage can we expect if these writes are optimized? To understand the possible advantages of the write-back cache let’s review the nature of datafile write IO in Oracle and the symptoms that occur when write IO becomes the bottleneck.

When a block in the buffer cache is modified, it is the responsibility of the database writer (DBWR) to write these “dirty” blocks to disk. The DBWR does this continuously and uses asynchronous IO processing, so generally sessions do not have to wait for the IO to occur – the only time sessions wait directly on write IO is when a redo log sync occurs following a COMMIT.

However, should all the buffers in the buffer cache become dirty then a session may wait when it wants to bring a block into the cache – resulting in a “free buffer wait”.

Free buffer waits can occur in update-intensive workloads when the IO bandwidth of the Oracle sessions reading into the cache exceeds the IO bandwidth of the database writer. Because the database writer uses asynchronous parallelized write IO, and because all processes concerned are accessing the same files, free buffer waits usually happen when the IO subsystem can service reads faster than it can service writes.

There exists just such an imbalance between reads and write latency in Exadata X2 – the Exadata Smart Flash Cache accelerates reads by a factor of perhaps 4-10 times, while offering no comparable advantage for writes. As a result, a very busy Exadata X2 system could become bottlenecked on free buffer waits. The Exadata Smart Flash Cache write-back cache provides acceleration to datafile writes as well as reads and therefore reduces the chance of free buffer wait bottlenecks.

The figure below illustrates the effectiveness of the write-back cache for workloads that encounter free buffer waits. The workload used to generate this data was heavily write-intensive with very little read IO overhead (all the necessary read data was in cache). As a result, it experienced a very high degree of free buffer waits and some associated buffer busy waits. Enabling the write-back cache completely eliminated the free buffer waits by effectively accelerating the write IO bandwidth of the database writer. As a result, throughput increased four fold.

However, don’t be misled into thinking that the write-back cache will be a silver bullet for any workload. Workloads that are experiencing free buffer waits are likely to see this sort of performance gain. Workloads where the dominant waits are for CPU, read IO, global cache co-ordination, log writes and so on will be unlikely to see any substantial benefit from the write-back cache.

[1] 11.2.3.2.1 is recommended as the minimum version for this feature as it contains fixes to significant issues discovered in the initial release.

It’s been tough to find time to do actual performance research of late, but I have managed to get a test system prepared that will allow me to determine if Solid State disks offer some performance advantage over spinning disks when the redo entries are very large. This is to test the theory that the results I’ve published in the past (here and here for instance) actually apply only when the redo entries are relatively small. For small sequential writes to SSD, each successive write will invoke an erase of a complete NAND page, whereas in a larger sequential write this will not occur since each write will hit different pages.

I’m still setting up the test environment to look at this, but first I thought it would be worth showing this pretty picture:

This chart shows how redo log sync time (eg, time taken to COMMIT) varies with the amount of redo information written since the last COMMIT. There is a slight overall upwards trend, but the really noticeable trend is the “sawtooth” effect, which I’ve highlighted below:

Can you guess what causes this?

I think it’s pretty clear that we are seeing the effect of redo buffer flushing. Remember, when you write redo entries, they are written to the redo buffer (or sometimes a strand). Oracle flushes the buffer when you commit, but also flushes it when it is 1/3rd full, after 3 seconds (I think from memory) or after 1MB of redo entries. Given that, we can see what happens when we commit:

If there has been no redo log flush since we started writing, we have to wait while LGWR writes all the entries to disk

If a redo log flush occurs after we have written our entries but before we COMMIT, then we have to write virtually nothing (but a COMMIT marker I suppose)

Between the two scenarios, we may have to write some of our redo log entries. However, we should never have to write more than about 1MB

In the chart above we can clearly see the redo log flushes occurring at 1MB intervals. If we write less than 1MB we generally have to write it all, above 1MB we only have to write a portion of the redo entry. Note that on this system, I was pretty much the only session doing significant activity, so the pattern is very clear. On a busy system the effect would be randomized by others causing flushes to occur.

Hopefully I’ll soon be able to compare HDD and SSD performance to see if there are any significant differneces to these trends – the above data was generated by redo on SSD.

When Steven and I wrote MySQL Stored Procedure programming our biggest reservation about the new stored procedure language was the lack of support for proper error handling. The lack of the SIGNAL and RESIGNAL clauses prevented a programmer from raising an error that could be propagated throughout a call stack properly, and the lack of a general purpose exception handler which could examine error codes at run time led to awkward exception handling code at best, and poorly implemented error handling at worst.

The exception handler is on lines 10-27. It catches any SQL exception, then uses the GET DIAGNOSTICS clause to fetch the SQLstate, MySQL error code and messages to local variables. We then decide what to do for anticipated errors – duplicate or badly formed product codes and SIGNAL a more more meaningful application error. Unexpected errors are RESIGNALed on line 24.

This is a great step forward for MySQL stored procedures – the lack of a means to programmatically examine error codes made proper error handling difficult or impossible. This is a major step forward in maturity.

Thanks to Ernst Bonat of www.evisualwww.com for helping me work through the usage of GET DIAGNOSTICS.

In my last post, I looked at the effect of the Exadata smart flash logging. Overall, there seemed to be a slight negative effect on median redo log sync times. This chart (slightly different from the last post because of different load and configuration of the system), shows how there’s a “hump” of redo log syncs that take slightly longer when the flash logging is enabled:

But of course, the flash logging feature was designed to improve performance not of the “average” redo log sync, but of the “outliers”.

In my tests, I had 40 concurrent processes writing redo as fast as they could. Occasionally this would result in some really long wait times. For instance, in this trace you see an outlier of 291,780 microseconds (the biggest outlier in my tests BTW) within an otherwise unremarkable set of waits:

To see if the flash logging feature was successful in removing these outliers, I extracted the top 10,000 waits from each of the roughly 8,000,000 waits I recorded in each category. Here’s a plot (non-logarithmic) of those waits:

So – the flash log feature was effective in eliminating or at least reducing very extreme outlying redo log sync times. Most redo log sync operations will experience no improvement or maybe even a slight degradation. But for the small number of log syncs that would have experienced a really excessive delay, the feature works as advertised – it reduces the chance of really excessive log file syncs.

In my opinion, this effect doesn't imply that the flash can process a redo log write faster than the magnetic disks - in fact probably the opposite is true. But given two desitinations to choose from, we avoid really long delays that occur when one of the destinations only is overloaded.

Exadata storage software 11.2.2.4 introduced the Smart flash logging feature. The intent of this is to reduce overall redo log sync times - especially outliers - by allowing the exadata flash storage to serve as a secondary destination for redo log writes. During a redo log sync, Oracle will write to the disk and flash simultaneously and allow the redo log sync operation to complete when the first device completes.

I’ve reported in the past on using SSD for redo including on Exadata and generally I’ve found that SSD is a poor fit for redo log style sequential write IO. But this architecture should at least do now harm and on the assumption that the SSD will at least occasionally complete faster than a spinning disk I tried it out.

My approach involved the same workload I’ve used in similar tests. I ran 20 concurrent processes each of which performed 200,000 updates and commits – a total of 4,000,000 redo log sync operations. I captured every redo log sync wait from 10046 traces and loaded them in R for analysis.

I turned flash logging on or off by using an ALTER IORMPLAN command like this (my DB is called SPOT):

So for this particular cell the flash disk “won” only 3.8% of times (7,337,931-7,318,741)*100/(7,337,931-7,318,741+33,201,462-32,669,310) and prevented no “outliers”. Outliers are defined as being redo log syncs that would have taken longer than 500 ms to complete.

Looking at my 4 million redo log sync times, I saw that the average and median times where statistically significantly higher when the smart flash logging was involved:

Plotting the distribution of redo log sync times we can pretty easily see that there’s actually a small “hump” in times when flash logging is on (note logarithmic scale):

This is of course the exact opposite of what we expect, and I checked my data very carefully to make sure that I had not somehow switched samples. And I repeated the test many times and always saw the same pattern.

It may be that there is a slight overhead to running the race between disk and flash, and that that overhead makes redo log sync times slightly higher. That overhead may become more negligible on a busy system. But for now I personally can’t confirm that smart flash logging provides the intended optimization and in fact I observed a small but statistically significant and noticeable degradation in redo log sync times when it is enabled.