However, before looking at these options there are some fundamentals and issues to consider when it comes to storage.

The Problem with Figuring Out what is Needed for an IO Workload:

When it comes to analyzing IO workload there are some basic questions that must be answered:

How Much?

There are two main measurements used to answer the question of “how much?” with storage. There is how much data is moved within a certain amount of time (MB/s) and/or how many logical operations there are (IOPS or input/output operations per second). When looking at the workload the amount of operations in the queue is also considered for data that has a steady rate.

How Much of What?

The operations are either going to be reads or writes. The reads or writes will also be either sequential or random. So you end up with some blend of the following four possibilities:

Sequential Reads

Sequential Writes

Random Reads

Random Writes

What is the Shape Over Time?

Sometimes the shape of IO will be a steady stream of data, but often disk operations will get batched to make the IO more efficient. This means that the shape of the IO will usually be spiky on a micro level (say over several seconds). There will also be a shape on the macro level because most services have peak usage times and there are also scheduled IO intensive operations such as backups.

How Fast?

With whatever workload, operations need to be fulfilled within a certain amount of time. If the disk IO system is busy, then operations have to wait in a queue to be fulfilled.

My main point with all of this is that it is not advisable to just take an average of the amount of IOPS and Megabytes per second over a day and go buy a storage solution. This does not account for how fast these IOPS are satisfied or the shape of data over time. Even if these are taken into account, and taken into account correctly, the workload still should be tested on the actual equipment. The best an analysis can do is give a hypothesis. This leaves two possible courses of action:

Take an educated guess and buy something

Set up a demo unit or go to a demo location and load test the application

The main problem with option number 2 is that this is a large time investment, and that you have to have the capability to load test your application in a way that accurately reflects real world usage.

My Personal Gripe with Storage Vendors

My gripe is that most large or fast storage solutions can not be purchased at Micro Center. More seriously, even when talking to a sales person on the phone they won’t say what all the options are and what they really cost. Sometimes there is an option to price them out on the website, but the real cost is what the sales person will knock it down to. The sales people want the data I was just talking about (often a limited subset which will only give them a rough ballpark). I have always hated this sales method, I want to see what I get for certain costs — not tell them how much I can spend and have them tell me after that.

The other big thing is that vendors are not public with their actual performance numbers. The SPC1 benchmarks are the best effort I have seen to provide useful information, but the amount of devices in that repository is limited. To a degree this is understandable given all the different workloads as I mentioned, but some basic numbers on various workloads under a Raid 10 configuration would be nice. In other words “Give me some data, please”.

Our Particular Situation and Scaling Model

The sheer size of the stackoverflow.com database compared to our other sites at the moment is a major factor for us. Looking at the above image the amount of IOPS on the stackoverflow.com database is 30 times the amount of IOPS for the superuser.com database. Because of this treating stackoverflow.com’s database as a separate entity for the other databases does make sense.

We also don’t need that much capacity. Going off of Nick Craver’s growth analysis the SO Database will grow from 85 GB to 256 GB over the next 36 months (note: this is just a projection).

I have mentioned this before but our model has been to strike a balance between scaling up and out. We are not particularly attracted to building giant monster systems, nor do we want a bunch of cheap little boxes. We want a balanced amount of medium powered servers. In my mind this fits well with a Microsoft stack.

The Current Options

So taking the above into account here is my current thinking on what the following options might mean for us. We don’t have any demo units in hand yet but our plan is to evaluate FuisionIO as the PCI card option, Equallogic as the SAN option, and Dell approved SSDs put into our current Dell r710 database servers.

Option 1: FusionIO

Pros:

FusionIO is going to be the fastest option out there. To quote Brent Ozar in his review of FusionIO: “The only way to outperform a Fusion-IO drive is to invest six figures in a SAN and hire a really sharp SAN admin.”

Simplicity. There is a lot that goes into configuring a SAN correctly, with FusionIO we would copy our database file to the FusionIO drive and be done.

Cons:

Limited single system availability. There doesn’t seem to be a simple RAID equivalent. For each single server there will likely be only one of these cards in each server. Two can be put in a server and set up to use software RAID but I wonder if that might just end up lowering the availability. In theory since these are solid state devices and not mechanical I would expect to have better reliability than hard drives, but the technology is still fairly new.

Limited multi-system availability. Any sort of SQL clustering options are out the window and what you have left is log shipping and synchronous or asynchronous mirroring.

I think the FusionIO option fits our scaling model well. We currently have two DB servers — a primary and a fail over. We are planning on expanding to 4 servers so Stack Overflow (and maybe the rest of the original trilogy) can have its own primary and secondary server. There are different options, but one 640GB FusionIO would cover growth for the trilogy and provide the fastest speed compared to a SAN or SSDs. We could then have asynchronous mirroring to the secondary server and in a failover situation Brent Ozar estimated 90 seconds of data loss. Downtime might be around 30 minutes until we get the secondary server up and going manually. We generally favor speed over the highest possible uptime. It is not that we are glib about the uptime of our service, but we don’t have the uptime requirements of a financial institution. For our sites with higher uptime requirements such as Careers we can use the storage in the servers and possibly synchronous mirroring. I also imagine either this or SSDs in the servers themselves will be the cheapest solution — the 640GB drives were quoted at about 10k each for us and the 320GB at about 6.5k from a vendor.

Option 2: A 10GE EqualLogic San with some SSDs

Pros:

Flexibility in growth and tiered storage. With a SAN we can add shelves as we grow and can tier our storage effectively. So for example we could have an SSD array for the trilogy and a SAS array for our smaller sites. As our sites grow we could move them accordingly. We could also use storage for logs or tempdb.

Flexibility in availability options. Unlike a fusion drive or SSDs in the server clustering options are now open to us.

10GE might be useful for other things if we start to hit network bottlenecks. This is the main reason why 10GE appeals to us more than fiber.

Cons:

The SAN as a logical unit is a single point of failure unless you buy two. I know these have lots of built-in redundancy but nonetheless our current thinking is that we would want two if we went the SAN route.

Cost. These things are not cheap. The EqualLogic PS6010S with 8 SSDs is priced at 46k on their site. The redundant 10GE switches if we go with Dell would be about 20k. So without even factoring in other various total cost of ownership factors if we want two SANs we are talking well over 100k. That would be the same cost as getting at least 8 more of our current database servers.

The flexibility and growth options that a SAN offers are appealing. The cost could drastically change if we decided we could live with one SAN, look at different vendors, or give up on the option of having SSDs in the SAN. The performance won’t be as high as the FusionIO would be but for our workload that extra performance might not really matter.

Option 3: SSDs in the Dell Servers

This option would be pretty similar to the FusionIO option except that there will be a trade off of an increase in single server availability options but a decrease in speed. With SSDs we can use a traditional RAID configuration in the servers. These drives on Dell’s site are 4.6k for a 2.5 inch SAS 3 Gbps drive, so for a mirror of two drives it would be $9.2k and would give us 150GB capacity. This wouldn’t leave us much room for growth so we would probably want 300GB capacity. In RAID 10 that will cost about 18.4k. At the moment I don’t have any data on how this would perform but with the current cost of the Dell approved SSDs for our servers this option isn’t too appealing to me yet.

Conclusion

Right now we are still in the preliminary stages. Initially I am fond of the FusionIO option where as George is more in favor of a SAN. One of the main reasons I favor FusionIO at the moment is that it would satisfy our growth in the short term, interestingly enough this is one of the reasons George is less fond of it. George’s main reason for a SAN is that we get greater flexibility with the features a SAN offers such as the ability to fully cluster, use snapshotting, full LUN replication, and dynamic expansion (on some models). By going the SAN route earlier and not later we don’t put off solving the problem until about a year from now when it might be harder to change our infrastructure. If we were to get a SAN now we will learn how to use the advantages given to us by the SAN’s flexibility. I generally agree with this philosophy but not in this particular case. The reason I favor putting off a SAN is that I feel in a year or so the SAN options that include SSDs might be a lot cheaper and more attractive. Also the FusionIO option fits well with our current scaling model. Although the growth of our Stack Exchange sites looks very promising to me, I feel it is too early to predict how they will grow. This is not so much in terms of visitors but more in terms of the IO workload growth. Our developers could make changes that greatly effect our IO workload. So we might have a much better understanding of what we need 6 months to a year from now than we do currently.

What we really need is to get more data on the performance of these various options and get our hands on some demo units. I feel like all 3 of these paths are valid options. Also there are valid options we haven’t looked as closely at yet (for example SANs that don’t support SSDs). At this point it is clear that choosing a storage route to take is no small task.

“The EqualLogic PS6010S with 8 SSDs is priced at 46k on their site” – with Dell you need to ignore absolutally every piece of pricing on their site. We specced up a similar Equallogic (but with 16x 15k disks) and then called up for their pricing – we got it to around 60% of the price on the website without pushing very hard.

Preliminary results show that the iops are far better then traditional harddisk, but still not that brilliant.

Anonymous

JJoos – I’d be interested in seeing your throughput results. I bet your bottleneck has become the RAID controller and SAS/SATA max. That’s where the Fusion-IO stuff comes in faster – it doesn’t pass through the SAS/SATA interface.

http://blog.stackoverflow.com George Beech

The Vertex 2’s are MLC type SSDs. Leaving out alot of deatail – MLC type SSDs are less expensive, but they also sacrifice performance compared to SLC type SSDs like the Intel x25-E series.

JJoos

And they have a lot more capacity.

http://twitter.com/MirceaChirea Mircea Chirea

Yes, but the controller will migitate any performance losses; SandForce is much faster than Intel. Besides, the Vertex (1/2/3) EX also use SLC.

http://michaelgorsuch.com/ Michael Gorsuch

Kyle – this is an amazing post. I don’t have much to offer, but I am interested to see where you all go with this. Storage problems in general are not that interesting, but storage problems for lean companies that are growing like gangbusters… well, those are fascinating to me.

Since you are in a colo facility, this will be even more interesting since you all presumably have power restrictions to work around.

Best of luck.

http://www.google.com/profiles/marco.cantu Marco Cantu

Improving hardware speed is very interesting, but what about changing the database and/or the data model? Different relational DBs and NoSQL databases will give you a radically difference performance and scalability. I know, also a significant investment on development!

Anonymous

Marco – typically changing a database back end or data model is a lot more expensive than throwing $20-$40k in hardware at the problem. Besides, when you change the database back end, you need a new set of hardware anyway, because you have to run the new database side-by-side with the old one during migration. You end up spending the same money anyway, so…

http://twitter.com/robbyt robbyt

Brent: throwing hardware at the problem works at year 0, but at year 3, not so much. Ask Google what they think about this problem. Ideally, throwing money at hardware should be the solution for short term while a more effective backend solution is being developed.

Anonymous

Robbyt – Right. When SO has to deal with Google scaling problems, I’ll get right on that. Thanks for the valuable tip. 😉

http://rjcox.co.uk/ Richard

From the table it appears the SO database has a single data file. Have you considered splitting it into two (this could start by simply having two data files in the file group; moving to multiple file groups to allow manually distributing different db objects into different data files) across separate arrays (and thus spindles). Two arrays will allow additional concurrency of IO and should therefore reduce latency — and with much lower initial investment on hardware. It would also you to demonstrate that the disk bandwidth/latency are the limiting factors here.

TL;DR: there are perhaps simpler options than you are considering here, especially if you consider more of the platform stack.

Anonymous

Richard – it’s pretty easy to split a filegroup into two files (assuming you rebuild all the indexes afterwards) but there’s not a significant IO benefit when you’ve got just one set of spindles. If/when SO moves to a SAN, that’ll be another story though, and that’s definitely something we’ll need to do.

Life isn’t as easy as just throwing in another array, though – we’re talking about rack-mount servers with a limited number of internal drive bays. Investing in an external RAID array isn’t a good call here – SO needs to move toward high availability, and external arrays aren’t good for clustering.

http://rjcox.co.uk/ Richard

Lack of space for more disks doesn’t make things easier. I suspect I’m recalling Jeff’s original server builds where there would be more space for more spindles (and that memory could itself be wrong :-)).

donflamenco

SO db lives on one set of spindles? What does that mean? Surely you are RAID? How many disks, what type?

You should be able to fit 6-8 2.5″ drives in any decent 1u-2u server.

The HP DL180G6, you can fit 25 2.5″ drives in 2u! RAID-10 with a flash backed write cache will give you a lot of headroom.

http://blog.stackoverflow.com George Beech

“One set of spindles” means one LUN presented by the RAID card to the OS. We are running a 6 disk RAID10 for the database files and a two disk RAID1 for the OS.

http://azdeveloper.pip.verisignlabs.com/ Craig Larsen

Putting the log file or certain tables into separate physical files on a different spindle won’t help you if you if your SAS card is the bottleneck and all your drives connect to that one card. Divide your workload up over multiple SAS cards. This also frees up the process that converts the log into actual database writes from being in contention for the same IO channel.

Imagine if you needed to move water between 3 above ground pools but only had one pump/hose to work with. You’d waste a good bit of time changing the single pump/hose to connect pool A/B and then B/C. If you had two pumps/hoses, you could run as fast as the amounts of water in each pool would allow.

(Not a perfect analogy, but hopefully you get what I am saying).

Chopper3

Kyle – what’s wrong with FC? >99% of all large/busy/highly-available databases in the world are stored on FC-based arrays. You have few servers so could easily use a small or built-in FC switch, also why go for enterprise SSDs in this array – wouldn’t a bunch of 15k disks be perfectly fast enough, be much cheaper, probably more resilient over time too. Something like a HP EVA4400 with the built-in 20-port switch, a single shelf filled with 600GB 15k’s in R10 and a handful of FC HBAs would cost you half what the equalogic would but still give you what you want.

http://kyle-brandt.myopenid.com/ Kyle Brandt

Well the main reason for 10GE is that the switches will have other uses instead of just being for the storage. As far as 10/15K SAS drives with one shelf that might be fine — I don’t know. Our problematic workload right now is random writes. SSDs are know for there random read performance, but I am still unsure as to the random write performance. I know the FusionIO would be amazing with this. As I said in my rant, I would love to get some actual data on these arrays. I would particularly like to know what the Random Write IOPs they can achieve with a fast response time (say 10-30ms).

Chopper3

But how are you going to keep storage traffic ‘protected’ from regular traffic? QoS can be costly and complex to implement and manage – that’s why I’m such a proponent of FCoE – it has traffic management built in by default. My concern about using SSDs is that the FIRST thing to ‘go’ on SSDs is the random write performance – look at all the benchmarks, most SSDs will see up to a 75% drop in random writes over the first 3-6 months – that’s why I like disks – not as fast as SSDs on day one but their ‘drop as filled/used’ rate is almost nothing.

http://rjcox.co.uk/ Richard

Usual answer for traffic separation would be VLANs rather than QoS, adds a little per-packet overhead but not much.

Chopper3

That provides no saturation management though – and storage can easily saturate.

http://twitter.com/robbyt robbyt

That’s true, you need to have a dedicated switch for storage. Some switches have “iscsi mode” which automatically identifies and prioritizes iscsi packets, at the cost of prioritizing network traffic.

http://darrenkopp.myopenid.com/ Darren Kopp

We moved from equilogic san with ssd’s to fusionio drives. we’ve been extremely happy with the fusion io drives. we still use our equilogic san, but for backups and file shares, all the database is on the fusionio drives.

My 2 cents: go with the fusion io drives, the equilogic san route in our experience was a waste of money that we wish we had back (being a startup)

donflamenco

Talk about old school. Looking at your previous blog, “Designing For Scalability of Management and Fault Tolerance”, I can’t see where you guys are doing caching on the DB. You must be doing some caching on the IIS tier. Can you talk about how your caching works? There are other options, maybe sharding out your q&a.

SANs are great for consolidations and things like vmware, but it won’t increase performance for a single host, especially a “smart” db server. Oracle is very smart about caching and SQL server should be too.

http://kyle-brandt.myopenid.com/ Kyle Brandt

@donflamenco: Regarding the caching on the DB server we have a very high SQL cache hit rate (Over 99%). The problem that we have is on our writes (which are random). The writes have to be flushed out to disk so more memory for caching won’t help us. If you notice the drop in the graph for read time that is right after a DB RAM upgrade — the writes were not really effected by this.

donflamenco

You should read this: http://www.quora.com/Friendster/What-were-the-key-mistakes-that-Friendster-made?srid=To Especially this quote “Another big problem was that instead of sharding (or partioning) the databases properly, the VP Eng and VP Ops attempted to scale by installing an expensive Hitachi SAN and using daisy-changed MySQL slave replication, neither of which was a good idea. The result was tons of money spent, plus bad slave lag.”

Obviously sharding is going to take a bunch of work from your developers, but that should be the longer term goal. I would stay away from the SAN unless you need a ridiculous amount of drives/space (which you don’t), or you want to snapshot/replicate.

You say $10k for 640GB for FusionIO? Peanuts, you should do that.

It is interesting how write “heavy” your DB is. Your DBAs should be able to profile heavy write queries. How large are your write caches on the PERCs? I’m assuming you have battery backed write cache. HP has a cool flash backed write cache (with a super cap), which is nice for colos, since the batteries on the battery backed write caches only last for 3-5 years.

http://twitter.com/cosjef Jeff Costa

This may not be “enterprise-sy” enough, but have you considered a OCZ RevoDrive instead of the Fusion IO drive? It might be a cheaper, stopgap option until FusionIO prices fall into your range.

http://twitter.com/MirceaChirea Mircea Chirea

Have you considered mainstream SSDs such as OCZ Vertex 2 (Vertex 3 is coming out in Q3 with double the performance)? The SandForce controller is currently the fastest controller you can get for a low price and in order to push MLC into write exhaustion you’ll need to thrash the drive 24/7 for 6 months, with random incompressible data. Intel is also coming out with X25-M/E G3; the E will use eMLC, which is MLC but with a higher write tolerance; however at the current improvement rate, and with the great wear leveling of modern controllers, the drives will be obsolete before you get close to breaking a RAID 10 array (and the specific sectors will simply become read-only).

Alternatively you may want to look at the new Hitachi SSD; it’s a monster, incredibly fast and serious capacity, with a SAS 6Gb/s interface. However, it’s also expensive, but probably not more than the one Dell ships. The Vertex 2 (and soon 3) EX also use SLC NAND, but at that point it’s better to just get the Hitachi.

http://twitter.com/Capncavedan Dan Buettner

Feedback on a very basic installation of 80GB FusionIO’s in a pair of MySQL database servers: awesome. Sped up simple reads and writes noticeably (we didn’t spend a lot of time measuring that), and also resulted in certain long-running reports completing somewhere between 50% and 10% of the original time.

Feedback on the EqualLogics: they are crap. A different system relies on a pair of these each in a RAID 6 configuration, and performance goes to lunch when a drive needs to be rebuilt. We also utilize internal replication across a WAN, and large updates have also resulted in sporadic performance problems.

JD Conley

We used Fusion IO at Hive7 for handling very Random I/O write heavy loads for social gaming. They were nothing short of amazing. We ran into some driver hiccups, but that was two years ago, and they were very early drivers. We used three systems with SQL 2005 Standard in a sharded environment for a multiplayer game that had about 20,000 simultaneous players at a given time served by 10 game servers. We didn’t have to do any fancy delayed persistence algorithms or other things you usually do for games. We just wrote the updated world data straight into the DB as commands rolled in, and the Fusion IO cards smiled back at us.

I would definitely go with a FC SAN storage. I do not understand why you prefer iSCSI.
iSCSI will cause unnecessary problems.

Stay away from entry level storage.
You can request test drive units from vendors. You can ask vendors for a quote and get the cheapest.

Why this is not on serverfault?

http://twitter.com/robbyt robbyt

You should check out Nexenta and connect via FC. If you buy from a good vendor and tell them your i/ops requirements, they should be able to build a scalable system for you that costs far less than a ‘six figure SAN’.

I currently work at company that utilizes FusionIO in their production SQL environment. In regards to speed of I/O it is as advertised and we are happy with the performance the direct attached storage gives us. In regards to HA and fault tolerance though if you are have large databases (ours is 1.2 TB) the replication offered doubletake or steel eye will not be sufficient. We currently are using a mirror to give us redundancy in case of card failure but to that any failover will not automatic and downtime will be incurred. The lack of a proper raid does keep me up and we have had to card failures in two years. Just take certain steps that when/if a card failure you are satisfied with the recovery time for the business. Its a title bout of speed of I/O vs reliability of SAN HA, in the end see which will let you get the most of your environment and but also a good nights sleep.

http://twitter.com/robbyt robbyt

Or, just connect iSCSI over 10gbe…

If you want to do it on the cheap, setup a Solaris11 server with a ton of disks and SSDs. We have a client that’s pushing about 900MBps sustained non-cached reads thru 10gbe to a ZFS 11-way stripe. (total array size 18tb)

FusionIO is phenomenal on performance. If you can scale horizontally across servers (IE shard your DB, by year, entries marked as deleted by moderators (or old versions of edited content), low query volume, or whatever works for your database layout). Then you will be on the right side of both the price and performance curve. As well as scaling w/o paying for the stupidly expensive “enterprise” solutions.

BTW, with modern RAID controllers you will probably find as well that RAID5 actually performs better than RAID10. In theory RAID10 should be faster, but normally there is much more work put into optimizing the RAID5 in the controllers’ firmware, however YMMV.

I’d agree that if you go SAN it needs to be redundant as well, which definitely increases the price. But if it’s where you are putting all your eggs you want a warm spare carton (to mix metaphors).

unix_dude

We were in the same situation as yourself last year, but we already had a SAN the problem was that the cost of adding new 4 new bays to the SAN, replacing the Brocades we already had, buying new FC cards for the new servers we needed etc proved to be too expensive compared to other solutions out there – we opted for the Fusion IO route which in the long run proved to be more cost effective for us and gave us the performance we needed. We made that decision after I tested out most of the viable SSD/Flash storage solutions available at the time. However in your situation you may wish to check the following (see link below) I tested one and the performance was actually really great you get the benefits of fault tolerance such as a standard SAN as well as future capacity growth. If you have the money it would probably be a better solution then opting for the Fusion IO cards.

There is also the Sun/Oracle F5100 array the performance is excellent although the cost is high but availability may be a problem due to the Oracle take over and I only ever tested this with Solaris running MySQL and again with Solaris running our SOLR applications so I don’t know how it would perform with windows and MSSQL.

The above two products would be my preferred choice if I had the budget to play around with and I needed performance which a SAN couldn’t provided at those prices.

http://www.kaminario.com Eyal Markovich

You have another option to consider: DRAM-based SSDs which offer best of both world – high performance and high availability. Check out Kaminario (www.kaminario.com) for example.

In SQL server, high latency writes to the database as reported by sys.dm_io_virtual_file_stats don’t automatically mean a problem as these writes are asynchronous to the user session (check out lazy write in SQL Server). However, in many cases, these writes affect the performance of reads and then you have a problem. I recently wrote a post on the important of write performance in http://www.theiostorm.com/the-importance-of-write-performance-to-rdbms-applications/. The fact that some of the storage IO is done in the background is also important when you select the workload for the benchmark. You should, focus on the wait events in SQL and not too much on the IO events. Understanding the waits and mainly the I/O wait of your application will help you understand what you need to improve in the storage and will help you customize the benchmark. Look in the http://www.theiostorm.com blog for posts on I/O wait and how to measure it in SQL.

A few words about the Kaminario solution (I am a Kaminario employee). The unique approach is a DRAM scalable solution that let you grow both capacity and performance with your application demand. It is not in the server so it offers a real high available solution that can be integrated with any clustering software. There are no controllers bottlenecks as these systems were built to deliver millions of IOPS and thousands of throughput while keeping microsecond latency.