More Memory Meanderings – IOPS and Form Factors July 19, 2010

I had a few comments when I posted on solid state memory last week and I also had a couple of interesting email discussions with people.

I seriously failed to make much of one of the key advantages of solid-state storage over disk storage, which is the far greater capacity of Input/output operations per second (IOPS), which was picked up by Neil Chandler. Like many people, I have had discussions with the storage guys about why I think the storage is terribly slow and they think it is fast. They look at the total throughput from the storage to the server and tell me it is fine. It is not great ,they say, but it is {let’s say for this example} passing 440MB a second over to the server. That is respectable and I should stop complaining.

The problem is, they are just looking at throughput, which seems to be the main metric they are concerned about after acreage. This is probably not really their fault, it is the way the vendors approach things too. However, my database is just concerned in creating, fetching, and altering records and it does it as input/output operations. Let us say a disk can manage 80 IOPS per second (which allows an average 12.5 ms to both seek to the record and also read the data. Even many modern 7,200 rpm discs struggle to average less than 12ms seek time). We have 130 disks in this example storage array and there is no overhead from any sort of raid or any bottleneck in passing the data back to the server. {This is of course utterly unbelievable, but if i have been a little harsh not stating the discs can manage 8ms seek time, ignoring the raid/hba/network cost covers that}. Each disc is a “small” one of 500GB. They bought cheap disk to give us as many MB/£ as they could {10,000 and 15,0000 rpm disks will manage 120 and 160 IOPS per second but cost more per MB}.

Four sessions on my theoretical database are doing full table scans, 1MB of data per IO {Oracle’s usual max on 10.2}, Each session receiving 100MB of data a second, so 400MB in total. 5 discs {5*80 IOPS*1MB} could supply that level of IOPS. It is a perfect database world and there are no blocks in the cache already for these scans to interrupt the multi-block reads.

However, my system is primarily an OLTP system and the other IO is records being read via index lookups and single block reads or writes.

Each IOP reads the minimum for the database, which is a block. A block is 4k. Oracle can’t read a bit of a block.

Thus the 40MB of other data being transferred from (or to) the storage is single block reads of 4k. 10,000 of them. I will need 10,000/80 disks to support that level of IO. That is 125 discs, running flat out.

So, I am using all my 130 discs and 96% of them are serving 40MB of requests and 4% are serving 400MB of requests. As you can see, as an OLTP database I do not care about acreage or throughput. I want IOPS. I need all those spindles to give me the IOPS I need.

What does the 40MB of requests actually equate to? Let us say our indexes are small and efficient and have a height of 3 (b-level of 2), so root node, one level of branch nodes and then the leaf nodes. To get a row you need to read the root node, branch node, lead node and then the table block. 4 IOs. So those 10,000 IOPS are allowing us to read or write 10,000/4 records a second or 2,500 records.
You can read 2,500 records a second.

Sounds a lot? Well, let us say you are pulling up customer records onto a screen and the main page pulls data from 3 main tables (customer, address, account_summary) and translates 6 fields via lookups. I’ll be kind and say the lookups are tiny and oracle just reads the block or blocks of the table with one IO. So that is 9IOs for the customer screen, so if our 40MB OLTP IO was all for looking up customers then you could show just under 280 customers a second, across all users of your database. If you want to pull up the first screen of the orders summary, each screen record derived from 2 underlying main tables and again half a dozen lookups, but now with 10 records per summary page – that is 80 IOs for the page. Looking at a customer and their order summary you are down to under thirty a second across your whole organisation and doing nothing else.

You get the idea. 2,500 IOPS per second is tiny. Especially as those 130 500GB disks give you 65TB of space to host your database on. Yes, it is potentially a big database.

The only way any of this works is due to the buffer cache. If you have a very healthy buffer cache hit ratio of 99% then you can see that your 2500 records of physical IO coming in and out of the storage sub-system is actually supporting 250,000 logical-and-physical IOPS. {And in reality, many sites not buffer at the application layer too}.

Using Solid State Storage would potentially give you a huge boost in performance for your OLTP system, even if the new technology was used to simply replicate disk storage.

I think you can tell that storage vendors are very aware of this issue as seek time and IOPS is not metrics that tend to jump out of the literature for disk storage. In fact, often it is not mentioned at all. I have just been looking at some modern sales literature and white papers on storage from a couple of vendors and they do not even mention IOPS – but they happily quote acreage and maximum transfer rates. That is, until you get to information on Solid State Discs. NOw, because the vendor can say good things bout the situation then the information is there. On one HP white paper the figures given are:

More and more these days, as a DBA you do not need or want to state your storage requirements in terms of acreage or maximum throughput, you will get those for free, so long as you state your IOPS requirements. Just say “I need 5000 IOPS a second” and let the storage expert find the cheapest, smallest disks they can to provide it. You will have TBs of space.

With solid-state storage you would not need to over-specify storage acreage to get the IOPS, and this is why I said last week that you do not need solid state storage to match the capacity of current disks for this storage to take over. We would be back to the old situation where you buy so many cheap, small units to get the volume, IOPS are almost an accidental by-product. With 1GB discs you were always getting a bulk-buy discount :-)

I said that SSD would boost performance even if you used the technology to replicate the current disk storage. By this I mean that you get a chunk of solid-state disk with a SATA or SAS interface in a 3.5 inch format block and plug it in where a physical disk was plugged in, still sending chunks of 4k or 8k over the network to the Block Buffer Cache. But does Oracle want to stick with the current block paradigm for requesting information and holding data in the block buffer cache? After all, why pass over and hold in memory a block of data when all the user wanted was a specific record? It might be better to hold specific records. I suspect that Oracle will stick with the block-based structure for a while yet as it is so established and key to the kernel, but I would not be at all surprised if something is being developed with exadata in mind where data sets/records are buffered and this could be used for data coming from solid state memory. A second cache where, if using exadata or solid-state memory, holding single records. {I might come back to this in a later blog, this one is already getting bloated}.

This leads on to the physical side of solid-state discs. They currently conform to the 3.5” or 2.5” hard disc form factor but there is no need for them to do so. One friend commented that, with USB memory sticks, you could stick a female port on the back of a memory stick and a joint and you could just daisy-chain the USB sticks into each other, as a long snake. And then decorate your desk with them. Your storage could be looped around the ceiling as bunting. Being serious, though, with solid state storage then you could have racks or rows of chips anywhere in the server box. In something like a laptop the storage could be an array 2mm high across the bottom the chasis. For the server room you could have a 1u “server” and inside it a forest of chips mounted vertically, like row after row of teeth, with a simple fan at front and back to cool the teeth (if needed at all). And, as I said last time, with the solid state being so much smaller and no need to keep to the old hard disk format, you could squeeze a hell of a lot of storage into a standard server box.

If you pulled the storage locally into your server, you would be back to the world of localised storage, but then LANs and WANs are so much faster now that if you had 10TB of storage local to your server, you could probably share it with other machines in the network relatively easily and yet have it available to the local server with as many and as fat a set of internal interfaces as you could get your provider to manage.

I’m going to, at long last, wrap up this current instalment on my thoughts with a business one. I am convinced that soon solid-state storage is going to be so far superior a proposition to traditional disks that demand will explode. And so it won’t get cheaper. I’m wondering if manufacturers will hit a point where they can sell as much as they can easily make and so hold the price higher. After all, what was the argument for Compact Discs to cost twice as much to produce as old cassette tapes, even when they had been available for 5 years? What you can get away with charging for it.

Like this:

Related

I think you may be being a little harsh on the IOPS you get from a standard disk. You might get 80 from a 7200rpm, but a 10K disk will give up to 120 iops, and a fast noisy expensive 15K disk will give between 150 and 180iops (and a 2ms rotational latency as opposed to 4.2ms in a 7200 speed disk – plus head movement of course). However, Enterprise-level uber-expensive SSD’s will give 5,000 iops and no rotational latency. If you’re monied enough to be able to fill an array with SSD, you might just need to ensure you’re specifying acreage along with the iops requirement. Now wouldn’t that be nice.

Ask for disk speeds, why have they not become faster than 15,000 rpm? I seem to recall someone once telling me that it was something to do with the laws of physics, but I’ve been unable to goole the veracity of that.

OK, maybe I was a little hard on the IOPS of modern disks Neil – let’s double it to 160 for the ultra-fast, ultra-noisy 15,000 rpm disks. :-) They are so expensive that it makes the SSDs look a better bet cost wise.

I’ve altered the post to say I am considering the cheaper disks that give more MB/£ as your point is valid but I can’t face re-working my example for 120 or 160 IOPS. Let’s just say if you beat up your boss enough to pay up for 15,000 rpm disks you will get almost twice the iops.

I think the top speed of 15,000rpm is indeed a physics think or, as Scotty on the Enterprise would say “she’ll nay take the strain, cap’n!”. CDs, which admittedly are a lot more flimsy than hard disk platters, will sometimes shatter at 10,000rpm (48X), at which point the g-force at the edge of the dis is around 7,200g. When the CD goes it usually smashes the drive, especially if you are running faster than 48X. I could be talking rubbish though as the platters in hard disks can be made of aluminium which is going to take a heck of a lot more strain than plastic. I suspect heat is more of an issue as the edge of the disk is going to be doing a couple of hundred miles an hour.

There is a lot of energy in fast spinning objects. I have seen the results of an ultra-centrifuge that lost an arm at speed and it punch a hole through the armoured housing and made a big dent in the concrete wall several meters away. I think if you had got in the way of that it would have ruined your morning. Maybe the reason that disk speed does not go beyond 15,000rpm is that, at that speed, if anything did go wrong (no matter how remote) you would need something stronger than the hard disk housing to contain the problem. Perhaps disk vendors don’t want to go maiming It Operations guys, it’s bad enough they have to hang around in noisy server rooms, crawling around under the floors and behind cabinets pumping out kilowatts of heat.

You touched briefly on local disk. Local disk seems to be coming back, not just using SSD locally but also in servers like Sun’s X4540 which seems to be a bit of a favourite here. One of the disadvantages for standard disks, certainly for the X4540, is that it’s so heavy that it takes 2 people to swap a disk. Just one more ‘hidden’ cost that people don’t take into account when buying these servers.

As far as performance goes, we benchmarked a X4540 using local standard disks against a DL580G5 with the Fusion IO SSD cards. The initial cost was roughly the same, performance was slightly better for the SSD cards and it didn’t have the maintenance overheads that came with the X4540. And as you say – we didn’t need the huge amount of storage (48TB I think) that came as a biproduct of having enough disk to give the throughput we required on the X4540.

It is indeed another and more complex story. In retrospect I should not have included the word “write” there :-) However, the point I am making remains – that the IOPS of your storage system can well be far more significant than transfer rates, especially where your cache is in no way helping.

Hi Martin,
“… should not have included the word “write” there…”
That was my point, exactly. I just can’t refrain from nitpicking.
Of course your point about IO/s stands – it is well-built on a solid base. And you’re preaching to the choir, as far as I am concerned :-)
Cheers!
Flado

[…] Now I’d like to show that the use of IOTs has the potential to make your block buffer cache (BBC) far more efficient. Going to disc is very,very slow compared to going to memory {NB solid state storage improves this situation but does not remove it}. The block buffer cache has always been critical to oracle SQL Select performance as it allows you to access data in memory rather than disc and in general the more block buffer cache you have the faster your system will be. {I am of the opinion that the BBC is even more important now than ever. As hard discs get larger we are seeing fewer and fewer spindles per GB of storage and, in essence, disc storage is effectively getting slower – because more data is hosted on the same number of spindles and those spindles are not themselves getting faster – I digress, for more details see posts Big Discs are Bad and IOPs and Form Factors} […]