Entries in ssd
(17)

Exadata storage software 11.2.2.4 introduced the Smart flash logging feature. The intent of this is to reduce overall redo log sync times - especially outliers - by allowing the exadata flash storage to serve as a secondary destination for redo log writes. During a redo log sync, Oracle will write to the disk and flash simultaneously and allow the redo log sync operation to complete when the first device completes.

I’ve reported in the past on using SSD for redo including on Exadata and generally I’ve found that SSD is a poor fit for redo log style sequential write IO. But this architecture should at least do now harm and on the assumption that the SSD will at least occasionally complete faster than a spinning disk I tried it out.

My approach involved the same workload I’ve used in similar tests. I ran 20 concurrent processes each of which performed 200,000 updates and commits – a total of 4,000,000 redo log sync operations. I captured every redo log sync wait from 10046 traces and loaded them in R for analysis.

I turned flash logging on or off by using an ALTER IORMPLAN command like this (my DB is called SPOT):

So for this particular cell the flash disk “won” only 3.8% of times (7,337,931-7,318,741)*100/(7,337,931-7,318,741+33,201,462-32,669,310) and prevented no “outliers”. Outliers are defined as being redo log syncs that would have taken longer than 500 ms to complete.

Looking at my 4 million redo log sync times, I saw that the average and median times where statistically significantly higher when the smart flash logging was involved:

Plotting the distribution of redo log sync times we can pretty easily see that there’s actually a small “hump” in times when flash logging is on (note logarithmic scale):

This is of course the exact opposite of what we expect, and I checked my data very carefully to make sure that I had not somehow switched samples. And I repeated the test many times and always saw the same pattern.

It may be that there is a slight overhead to running the race between disk and flash, and that that overhead makes redo log sync times slightly higher. That overhead may become more negligible on a busy system. But for now I personally can’t confirm that smart flash logging provides the intended optimization and in fact I observed a small but statistically significant and noticeable degradation in redo log sync times when it is enabled.

In my previous post on this topic, I presented data showing that redo logs placed on an ASM diskgroup created from exadata griddisks created from flash performed far worse than redo logs placed on ASM created from spinning SAS disks.

Of course, theory predicts that flash will not outperform spinning magnetic disk for the sequential write IOs experienced by redo logs, but on Exadata, flash disk performed much worse than seemed reasonable and worse than experience on regular Oracle with FusionIO SSD would predict (see this post).

Greg Rahn and Kevin Closson were both kind enough to help explain this phenomenon. In particular, they pointed out that the flash cards might be performing poorly because of the default 512 byte redo block size and that I should try a 4K blocksize. Unfortunately, at least on my patch level (11.2.2.3.2), there appears to be a problem with setting a 4K blocksize

As expected, redo log performance for SSD still slightly lags that of SAS spinning disks. It’s clear that you can’t expect a performance improvement by placing redo on SSD, but at least the 4K blocksize fix makes the response time comparable. Of course, with the price of SSD being what it is, and the far higher benefits provided for other workloads – especially random reads – it’s hard to see an economic rationale for SSD-based redo. But at least with a 4K blocksize it’s tolerable.

When our Exadata system is updated to the latest storage cell software, I’ll try comparing workloads with the Exadata smart flash logging feature.

In this Quest white paper and on my SSD blog here, I report on how using a FusionIO flash SSD compares with SAS disk for various configurations – datafile, flash cache, temp tablespace and redo log. Of all the options I found that using flash for redo was the least suitable, with virtually no performance benefit:

That being the case, I was surprised to see that Oracle had decided to place Redo logs on flash disk within the database appliance, and also that the latest release of the exadata storage cell software used flash disk to cache redo log writes (Greg Rahn explains it here). I asked around at OOW hoping someone could explain the thinking behind this, but generally I got very little insight.

I thought I better repeat my comparisons between spinning and solid state disk on our Exadata system here at Quest. Maybe the “super capacitor” backed 64M DRAM on each flash chip would provide enough buffering to improve performance. Or maybe I was just completely wrong in my previous tests (though I REALLY don’t think so :-/).

Our Exadata 1/4 rack has a 237GB disk group constructed on top of storage cell flash disk. I described how that is created in this post. I chose 96GB per storage cell in order to allow the software to isolate the grid disks created on flash to 4 24GB FMODs (each cell has 16 FMODs). Our Exadata system has fast SAS spinning disks – 12 per storage cell for a total of 36 disks. Both the SAS and SSD disk groups had normal redundancy.

I ran an identical redo-intensive workload on the system using SAS or SSD diskgroups for the redo logs. Redo logs were 3 groups of 4GB per instance. I ran the workload on it’s own, and as10 separate concurrent sessions.

The results shocked me:

When running at a single level of concurrency, the SSD based ASM redo seemed to be around 4 times slower than the default SAS-based ASM redo. Things got substantially worse as I upped the level of concurrency with SSD being almost 20 times slower. Wow.

I had expected the SAS based redo to win – the SAS ASM disk group has 36 spindles to write to, while the SSD group is (probably) only writing to 12 FMODs. And we know that we don’t expect flash to be as good as SAS disks for sequential writes. But still, the performance delta is remarkable.

Conclusion

I’m yet to see any evidence that putting redo logs on SSD is a good idea, and I keep observing data from my own tests indicating that it is neutral at best and A Very Bad Idea at worse. Is anybody seeing any similar? Does anybody think there’s a valid scenario for flash-based redo?

The default – or at least a very common - configuration for Exadata is to configure all the flash as Exadata Smart Flash Cache (ESFC). This is a simple and generally performant configuration, but won’t be the best choice for all cases. In particular, if you have table which is performance critical, and it could fit in the flash storage you have available, you might be better off configuring some of your flash as grid disk, creating an ASM disk group from that, and putting the table there.

Here’s the procedure:

1. Drop the flash cache, create a new flashcache of a smaller size, then create the griddisks from the unallocated space. These CELLCLI commands do that:

There’s 384G of flash on each storage cell, so the above commands create about 96G of SSD grid disk. Run those commands on each cell node, perhaps by using the CCLI command (see this post for an example).

2. The above procedure will create disks in the format o/cellIpAddress/ssddisk_FD_*_cellnode. Log into an ASM instance, and issue the following command to create a diskgroup from those disks:

Alternatively you can use the database control for the ASM instance to create the new diskgroup. Your new flash disks should show up as candidate disks.

The relative performance of flash disks, vs flash cache is similar in Exadata to what I’ve seen using the Database flash cache. Placing an object directly on flash is faster than using the cache, although the cache is very effective. Here’s the results for 200,000 primary key lookups across 1,000,000 possible primary keys:

I’ve been doing some performance benchmarks on our exadata box specifically focusing on the performance of the smart flash cache. I found that even if I switched the CELL_FLASH_CACHE storage setting to NONE, the flash cache will still keep cached blocks in flash and would therefore give me artificially high values for “cell flash cache read hits” statistic when I set CELL_FLASH_CACHE back to DEFAULT or KEEP. What I needed was a way to flush the Exadata flash cache.

Unfortunately there doesn’t seem to be a good way to flush the flash cache – no obvious CELLCLI command. Maybe I’ve missed something, but for now I’m dropping and recreating the flash cache before each run.

Luckily the dcli command lets me drop and recreate on each cell directly from the database node and even sets up passwordless connections. Here’s how to do it.

Firstly, create a script that will drop and recreate the flash cache for a single cell:

The “-k” option copies the ssh key to the cell nodes which means that after the first execution you’ll be able to do this without typing in the password for each cell node. The “—serial” option makes each command happen one after another rather than all at once – you probably don’t need this…