Posted
by
timothy
on Wednesday June 04, 2008 @01:22PM
from the add-it-to-everything-please dept.

BobB-nw writes "Sun will release a 32GB flash storage drive this year and make flash storage an option for nearly every server the vendor produces, Sun officials are announcing Wednesday. Like EMC, Sun is predicting big things for flash. While flash storage is far more expensive than disk on a per-gigabyte basis, Sun argues that flash is cheaper for high-performance applications that rely on fast I/O Operations Per Second speeds."

That definately makes sense, It is a very expensive idea to make EVERYTHING SSD, it doesn't really make sense, because except for on the local level it wouldn't really make a huge difference either, due to network bandwidth limits.

I was thinking about this at Fry's the other day when trying to decide whether I could trust the replacement Seagate laptop drive similar to the one that crashed on me Sunday, and I concluded that the place I most want to see flash deployed is in laptops. Eventually, HDDs should be replaced with SSDs for obvious reliability reasons, particularly in laptops. However, in the short term, even just a few gigs of flash could dramatically improve hard drive reliability and battery life for a fairly negligible increase in the per-unit cost of the machines.

Basically, my idea is a lot like the Robson cache idea, but with a less absurd caching policy. Instead of uselessly making tasks like booting faster (I basically only boot after an OS update, and a stale boot cache won't help that any), the cache policy should be to try to make the hard drive spin less frequently and to provide protection of the most important data from drive failures. This means three things:

A handful of frequently used applications should be cached. The user should be able to choose apps to be cached, and any changes to the app should automatically write through the cache to the disk so that the apps are always identical in cache and on disk.

The most important user data should be stored there. The user should have control over which files get automatically backed up whenever they are modified. Basically a Time Machine Lite so you can have access to several previous versions of selected critical files even while on the go. The OS could also provide an emergency boot tool on the install CD to copy files out of the cache to another disk in case of a hard drive crash.

The remainder of the disk space should be used for a sparse disk image as a write cache for the hard drive, with automatic hot files caching and (to the maximum extent practical) caching of any catalog tree data that gets kicked out of the kernel's in-memory cache.

That last part is the best part. As data gets written to the hard drive, if the disk is not already spinning, the data would be written to the flash. The drive would spin up and get flushed to disk on shutdown to ensure that if you yank the drive out and put it into another machine, you don't get stale data. It would also be flushed whenever the disk has to spin up for some other activity (e.g. reading a block that isn't in the cache). The cache should also probably be flushed periodically (say once an hour) to minimize data loss in the event of a motherboard failure. If the computer crashes, the data would be flushed on the next boot. (Of course this means that unless the computer had boot-firmware-level support for reading data through such a cache, the OS would presumably need to flush the cache and disable write caching while updating or reinstalling the OS to avoid the risk of an unbootable system and/or data loss.)

As a result of such a design, the hard drive would rarely spin up except for reads, and any data frequently read would presumably come out of the in-kernel disk cache, so basically the hard drive should stay spun down until the user explicitly opened a file or launched a new application. This would eliminate the nearly constant spin-ups of the system drive resulting from relatively unimportant activity like registry/preference file writes, log data writes, etc. By being non-volatile, it would do so in a safe way.

This is similar to what some vendors already do, I know, but integrating it with the OS's buffer cache to make the caching more intelligent and giving the user the ability to request backups of certain data seem like useful enhancements.

Thoughts? Besides wondering what kind of person thinks through this while staring at a wall of hard drives at Fry's?:-)

I disagree that these disks should be used as a write cache. Frequent, incremental modifications to files is exactly what you DON'T want to use flash/SSD for, since it will wear out larger disk "blocks" faster than regular hard-disk writing. If you're not going to take advantage of HDD technology's superior write lifetime, you might as well not have one at all.

Five years ago, I would have agreed. These days, some of the better flash parts are rated as high as a million write cycles. If we're talking about 4 GB of flash, a million write cycles on every block would take a decade of continuous writes at 10 megabytes per second. Real-world workflows obviously won't hit the cache nearly that hard unless your OS has a completely worthless RAM-based write caching algorithm.... Odds are, the computer will wear out and be replaced long before the flash fails. That said, in the event of a flash write failure, you can always spin up the drive and do things the old-fashioned way. And, of course, assuming you put this on a card inside the machine, if it does fail, you wouldn't have to replace the whole motherboard to fix the problem.

That said, to reduce thrashing of the write cache, it might be a good idea to add a cap of a meg or two and spin up the hard drive asynchronously once the write cache size exceed that limit. Continue writing to the flash to avoid causing user delays while the HD spins up (huge perceived user performance win there, too) and flush once the drive is up to speed.

You could also do smart caching of ephemeral data (e.g. anything in/tmp,/var/tmp, etc.). Instead of flushing changes those files to disk on close, wait to flush them until there's no room for them in the RAM buffer cache, and then flush them to the flash. After all, those directories get wiped on reboot anyway, so if the computer crashes, there's no advantage to having flushed anything in those directories to disk....

BTW, in the last week, I've lost two hard drives, both less than a year old. I'm not too impressed with the write lifetimes of Winchester disk mechanisms.:-)

Why not just add more system ram for the kernel to use as cache ? I realize flash is a bit cheaper than DRAM now, but still it seems like a very roundabout way to improve performance when existing OS architecture can do the same job and is well documented.

Adding another layer of caching via flash just means we'll have more things to go wrong.

Because write caches in RAM go away when your computer crashes, the power fails, etc. Battery-backed RAM is an option, but is a lot harder to get right than a USB flash part connected to an internal USB connector on a motherboard.... In-memory write caching (without battery backup) for more than a handful of seconds (to avoid writing files that are created and immediately deleted) is a very, very bad idea. There's a reason that no OS keeps data in a write cache for more than about 30 seconds (and even that is about five times too long, IMHO).

Write caching is the only way you can avoid constantly spinning up the disk. We already have lots of read caching, so no amount of improvement to read caching is likely to improve things that dramatically over what we have already.

Even for read caching, however, there are advantages to having hot block caches that are persistent across reboots, power failures, crashes, etc. (provided that your filesystem format provides a last modified date at the volume level so you can dispose of any read caches if someone pulls the drive, modifies it with a different computer, and puts the drive back). Think of it as basically prewarming the in-memory cache, but without the performance impact....

You're either incredibly lucky, haven't owned many hard drives or lying. I'll presume you're honest (despite your handle) and the other two are equally likely in my opinion. Personally I've had at least 15 hard drives crap out on me over the years, not those of counting friends, family and coworkers which sends the number well into triple digits. And those are just those I've seen first hand. Add in the ones I know about at companies I've worked in and the number is in the thousands easily. Hard drive

I also admit that I use RAID 5 on my home server (my main data store) but not on my other computers.

Actually I use RAID 0 on one of my servers. Still lost two hard drives on it.

How the hell do you get power surges in a computer?

Power supplies do not filter out all surges and by design they can't help with power dips. Flip you box on/off 20 times in under 5 seconds and you'll likely have some dead equipment. Have a lightning strike in your vicinity and you likely will have some friend equipment regardless of your power supply make. Recently I had a loose neutral wire on my main in my house which made voltages swing by + or - 40 volts. Not good for t

Erm. You have a UPS on your computer and it still has dodgy power?You have something seriously wrong in that case.

No, that's why I don't have hard drives dying from surges anymore. I learned my lesson long ago after I fried one too many bits of electronics. I've lived in too many areas with dodgy power to trust what comes out of the wall anymore. Now my main problems are static and shock from the odd bit of dropped equipment.

Though I did once have a UPS and the attached computers fry due to a nearby lightning strike. Get 1.21gigawats across a UPS and it doesn't matter what you've got protecting it.:-)

Forgot to mention ethernet cables and phone cables are susceptible to power surges too - more so than the main power cable in my experience. I've seen more than a few fried ethernet cards and motherboards from an insufficiently protected dsl modem.

the cache policy should be to try to make the hard drive spin less frequently

Actually, this is exactly what you don't want to do. What you really want to do with an HDD is leave it spinning for as long as possible. Spin-up is when you get most of the mechanical wear, thus shortening the life of the drive. As an added bonus it uses a lot of power, too.

Yes and no. You're right that spinning up and down causes more mechanical wear on the spindle motor. However, leaving drives in laptops running continuously is also bad. Hard drives don't like heat, and laptop enclosures are not designed to dissipate heat from the drive. They basically have zero airflow across the drive, so the top of the drive enclosure and the case develop this layer of heated air that further insulates the drive from dissipating heat.

Good points, most of my drive knowledge relates to big RAID installations, so I wasn't thinking about laptops when I responded. My own preference would be to replace the HDD in the laptop entirely, and that's getting more and more reasonable.Yes, IDE flash drives are pretty expensive, but you can get a 32GB CF card and a CF-IDE adapter for around $150 last time I checked. Supposedly the tech that allows for 32GB CF also makes 64GB possible, which is the sweet spot for me on a laptop, but I don't seem to be

Last I checked, flash write performance (at least for CF and USB stuff) still left something to be desired. A laptop hard drive still exceeds the speed of flash when writing by a factor of about 2-3 or thereabouts. The read performance is almost even for 5400 RPM laptop drives, and within a factor of 1.5 of even 7200 RPM drives, so that's more livable. You probably wouldn't want to capture audio or video to a flash drive or do other tasks that involve continuous high speed writes, at least for now.

A single ioFusion [tgdaily.com] card has the concurrent data serving ability of a 1U server cabinet full of media servers. They do this by having 160 channels on a drive controller that also incorporates flash memory. Since each channel is a few orders of magnitude faster than a mechanical hard drive, one card can handle a flurry of concurrent random access requests as fast as 1000 conventional hard drives.

The perfect thing for serving media, where you don't need a few GB per customer, you need the same few GB served o

I was sure that figure was upwards of a million cycles per sector in modern flash chips.

Also, throw in wear-leveling and spare sectors. a million writes to a file system sector doesn't mean a million writes to a particular physical sector (could be 1000 writes each to a 1000 different sectors) and when a sector does wear out, it simply gets put out of service and is replaced with a spare one. this same principle is used in mechanical hard drives. if a sector is problematic to read from/write to, it gets marked as bad and the file system sector is remapped to somewhere else.

SSDs could quite likely last longer than mechanical hard drives in this regard.

No marketing or sales executive will ever countenance the adding of "spare" sectors to a disk. If there are 100 billion physical sectors, then by God it's going to say so on the sales info.

In order to do wear leveling you have to have additional metadata, which will take up additional space on disk. With SSD you pay a premium per byte over a magnetic disk. Do you really want a file system that's not going to make the best use of the space you just bought for an arm and a leg?

As capacity goes up, the feature size on flash gets smaller. This means less energy per bit and a thinner dielectric.So, as density of flash goes up, write cycle lifetime potentially goes down.

HDDs have the same issue of bits being less "durable" as capacity goes up. However, the media never wears out for HDD. Furthermore, it is already accepted that there will be many bit errors and these are simply corrected with error correction codes and mapping out bad sectors.

I was sure that figure was upwards of a million cycles per sector in modern flash chips.

I keep seeing this figure of a million cycles per sector but have yet to see it on a datasheet.

To paraphrase the above response;

"No marketing or sales executive will ever countenance the downward estimation of erase/write cycles per sector. If the chip can survive one million cycles per sector, then by God it's going to say so on the sales info."

Last time I checked no one was listing the lifetime of their chip at more than 100k cycles per sector.

What manufacturer is quoting anything near one million erase/write cycles per sector?

Why would you bother putting the programs and operating system on SSD for a server? Once the files are loaded into memory, you'll never need to access them again. SSD only helps with OS and Programs when you are booting up, or opening new programs. This almost never happens on most servers.

Random access time is (evidently) much better on flash. That's how Vista ReadyBoost works - there's a performance boost (a tiny one) if you let it put the non-sequential parts of the swap file onto a flash key.

I imagine that you could increase performance for some types of databases by running on a solid-state drive.

It sounds like the SSDs are internal drives for the server. A database would never be stored on an internal hard drive. Almost any commercial database is connected to a disk farm through SAN fabric.

SSDs really shine for OLTP databases. Lots of random IO occurs on these databases (as opposed to data warehouses that use lots of sequential IO).

Normal hard drives are horrible for random IO because of mechanical limitations. Think about trying to switch tracks on a record player thousands of times per second; this is whats happening inside a hard drive (under a random IO load). Its amazing mechanical HDDs work as well as they do.

Ummm, most programs are not completely loaded into memory and inactive pages do get swapped out in favor of active pages. While the most active regions of a program are in memory most of the time, having the whole program in memory is not the general case.Also, DRAM burns ~8W/GB (more if FB-DIMMS), Flash burns only 0.01W/GB. Thus swapping inactive pages to Flash allows you to use your DRAM more effectively, improving your performance/W.

From a different perspective: you have a datacenter and you are energy c

But the original post was talking about putting the actual programs and OS themselves on the SSD, not the swap file. Obviously it makes sense to put the swap file, and any other frequently and random accessed, files on the SSD.

We are going to have two layers, but they'll be deeper in the filesystem than that.

High frequency, low volume operations - metadata journalling, certain database transactions - will go to flash, and low frequency, high volume operations - file transfers, bulk data moves - will go to regular hard drives. SSDs aren't yet all that much faster for bulk data moving, so it makes the most economic sense to put them where they're most needed: Where the IOPs are.

Back in the day, a single high-performance SCSI drive would sometimes play the same role for a big, cheap, slow array. Then, as now, you'd pay the premium price for the smallest amount of high-IOPs storage that you could get away with.

You forgot the 1000 comments prognosticating about SSDs replacing HDDs permanently "any day now" with the added bravado of saying "I knew this would happen! See, I told you!" with 3000 comments replying 'Yeah, but price/performance!", all of which will be replied to with "but price/performance doesn't matter, n00b. Price makes no difference to anyone."

Then, in a fit of wisdom, a few posters, all of whom will be modded down as flamebait, will say "There's room for both and price/performance does matter, at least for now."

I'm just glad there is enough interest in paying for the performance to keep the development moving at a decent clip, flash really does look like it will have a big advantage for laptop users that are not obsessed with storing weeks worth of video.

And one person (i.e., me) to mention drum storage from back in the old days (i.e., before I was born), just like every other time it's come up and I've seen it and been bothered to comment, and probably with a link to multicians.org [multicians.org].

Someone once told me that IBM used head-per-track disks for VM paging on some of their early mainframes. That was a performance hot spot that could addressed by spending money on a very fast storage device.

They are trying to push new technology on their high paying customers because they can get a premium since it's a scarce resource, this will drive up production, and down the costs, and soon we'll all be toting massive flash disks all the day

This is just a story about SUN doing something that others have already done in for sometime now

Really? What other top 5 computer manufacturer has been putting flash drives in SERVERS? I've seen a few laptops, but I haven't seen any used in servers or storage systems. (EMC and a few others have announced plans to do it, but haven't released anything AFAK)

Also, their "thumper" server has 48 drives in it. Would you want to pay around $1000 per drive to fill that up?

$48k? Chump change. I remember back when the company I worked for at the time paid over six figures for a pimped out server back in the late 90's...

Server hell - I had an SGI Octane on my desk for a while that cost almost $50K. Two years later it had roughly the same performance as a $5K PC. Plus we had an SGI Onyx AND and Origin2000 that cost a cool quarter million each, plus maintenance.

You're confusing two very different sorts of storage. There is bulk data storage. This is a fileserver for home directories, video archives, piles of email, that sort of stuff. This is the market where the 1TB sas drive thrives. Then there's the database backing store. Almost every customer I've sold to wants a huge number of very fast, very small drives for database backing store. The extra capacity is meaningless, as they have to use so many spindles to get a decent IOPS performance. In this area, selling

Samsung will have Multi Level Cells, which are slower (and cheaper). The Single Level cells are faster (up to twice as fast I think), but more expensive.
You can go either way with it, but I think faster (and smaller) drives are more attractive than bigger and slower.
You need to compete against the sequential speed of a 15,000 rpm SCSI drive too (SSD will beat them dead on access speed, but not all workloads are small random reads)

Just mod him funny - then we will all know it's a joke, without having to exercise our critical thinking skills. I know I don't come here to do that, I rely on the moderators.
Here, I've even got mod points, I'll do it for you.... Oops!

Uh, there's a large difference between 'Sun is adding Flash storage to most of its servers' and 'SSDs are becoming mainstream in the corporate world.' Just because one vendor sells machines with SSDs doesn't mean that they will be bought or even used in the way Sun intends.You'll know they're becoming mainstream when, two weeks from now, you start getting mails from contract house recruiters perusing your resume on Monster looking for "SSD Storage Engineers" with "10 or more years of experience on Sun equi

Current versions of ZFS have the feature where the ZIL (ZFS Intent Log) can be separated out of a pool's data devices and onto it's own disk. Generally, you'd want that disk to be as fast as possible, and these SSDs will be the winner in that respect. Can't wait!

Current versions of ZFS have the feature where the ZIL (ZFS Intent Log) can be separated out of a pool's data devices and onto it's own disk. Generally, you'd want that disk to be as fast as possible, and these SSDs will be the winner in that respect. Can't wait!

As far as I know, contiguous writing of large chunks of data is slower for flash drives than plain HDD's. I'm guessing the ZIL is some kind of transactional journal log, where all disk writes go before they hit the main storage section of the filesystem? I don't think you'd get much of a speed bonus. SSDs are only really good for random access reads like OLTP databases.

The benchmarks say something like a 200x performance by putting the ZIL onto the an alternate high performance logging device.

I have been actively researching a vendor who will supply this type of device. Currently we're testing with Gigabyte i-Ram cards, connected in through a separate SATA interface. (Note: Gigabyte are battery backed SDRAM.. but I won't have lost power for 12 hours so it's a non-issue for me)

Fusion-IO is a vendor who is making a board for Linux - but as near as I can tell the cards aren't available yet, and when they are - they won't work with Solaris anyway!

The product which Neil Perrin did his testing with (umem/micromemory) with their 5425CN card doesn't work with current builds of Solaris. Umem is also a pain to work with.. they don't even want to sell the cards (I managed to get some off eBay)

I hope Sun lets me buy these cards separately for my HP proliant servers. Of course if they didn't, this is one thing that might make me consider switching to Sun Hardware! (Hey HP/Dell - are you reading this??)

Given that you can get flash disks that hang off pretty much any common bus used for mass storage(IDE, SATA, SAS, USB, SPI, etc.) "Adding a flash storage option" is pretty much an engineering nonevent, and a very minor logistical task.

If Sun expects to sell a decent number of flash disks, or is looking at making changes to their systems based on the expectation that flash disks will be used, then it is interesting news; but otherwise it just isn't all that dramatic. While flash and HDDs are very different in technical terms, the present incarnations of both technologies are virtually identical from a system integration perspective. This sort of announcement just doesn't mean much at all without some idea of expected volume.

Connected to a PCI-x16 133mhz interface we're talking about at a non-blocking device with up to a 4GB/s connection to the bus -- compared to a SATA or SAS interface which is at best 3gb/s.. if you can even find a device interface that fast.(notice the GB GIGABYTES vs. gb GIGABITS).. also the seek times of ~50 microseconds really turns me on.Yeah.. I know I could buy a 4gb/s FC RAMSAN unit -- anybody got $50k laying around.. oh wait, I need redundancy so that's $100k. (and btw it's *STILL* slower than one

Re: "Adding a flash storage option" is pretty much an engineering nonevent, and a very minor logistical task.

You have no idea what you are talking about. Sun customers demand that the product Sun sells them have known reliability properties and that Sun guarantees their products properly interact with each other. It takes a significant amount of resources to do this validation. At the same time SSDs and HDDs react very differently to load and can have all sorts of side effects if the OS/application is not prepared to deal with them.

Sure, this is definitely not a big engineering feat. However, neither of Suns competitors, IBM or HP, offer SSDs at the moment. This article is indicating that SSDs are gaining momentum (if not attention) in the high-end server market. Suppliers need to offer a product before it can be sold.

People (read: vendors) now frequently refer to flash storage as superior when IOPs are the main issue.

From what I've been able to discern this is actually true only in read-mostly applications and applications where writes are already in neat multiples of the flash erase block size.

If you're doing random small writes your performance is likely to be miserable, because you'll need to erase blocks of flash much larger than the data actually being changed, then rewrite the block with the changed data.

Some apps, like databases, might not care about this if you're able to get their page size to match or exceed that of the underlying storage medium. Whether or not this is possible depends on the database.

For some other uses a log-oriented file system might help, but those have their own issues.

In general, though, flash storage currently only seems to be exciting for random read-mostly applications, which get a revolting performance boost so long as the blocks being written are small enough and scattered enough. For larger contiguous reads hard disks still leave flash in the dust because of their vastly superior raw throughput.

When I last investigated this, the actual write cycle time was also much higher than for a 10k or 15k RPM disk. I think this is the bigger issue, not the fact that you have to read the whole page back first (something which is very fast anyway)

I work in a company that has a few thousand servers running in a few regional data centers. We are looking into SSDs not because of their superior IOPs (this is a mitigating factor vs HDD performance) but because of their low power consumption and low heat dissipation. When you scale your operations reach a scale where you are using an entire data center, heating and power become more and more of a cost issue. Right now we are trying to build some hard data on actual sabings, but there's lots of spin out there that gives you an idea of what potential savings could be. Here are a few interesting links, google around for more information, there's plenty to be had:

In the time between now and when SSD becomes cheaper than magnetic storage, might we see a resurgence of RAID 4? RAID 4 stripes data across several disks, but stores parity information all on one disk, rather than distributing the parity bits like RAID 5.

This has benefits for workloads that issue many small randomly located reads and writes: if the requested data size is smaller than the block size, a single disk can service the request. The other disks can independently service other requests, leading to much higher random access bandwidth (though it doesn't help latency).

One of the side effects of this is that the parity disk must be much faster than the data disks, since it must service all requests, to provide the parity info. Here SSD shines, with its quick random access times, but poor sequential performance. Interesting, no?

At the moment high performance SSDs are still more expensive than RAM. Since a 64 bit processor can address vast amounts of RAM, wouldn't it be even better and cheaper just to have 200GB of RAM rather than 200GB of SSD?

Okay, you would still need a HDD for backing store, but in many server applications involving databases (high performance dynamic web servers for example) a normal RAID can cope with the writes - it's the random reads accessing the DB that cause the bottleneck. Having 200GB of database in RAM with HDDs for backing store would surely be higher performance than SSD.

For things where writes matter like financial transactions, would you want to rely on SSD anyway? Presumably banks have lots of redundancy and multiple storage/backup devices anyway, meaning each transaction is limited by the speed of the slowest storage device.

Since a 64 bit processor can address vast amounts of RAM, wouldn't it be even better and cheaper just to have 200GB of RAM rather than 200GB of SSD?I don't know that you'd really need a lot of flash memory for this, maybe only a few meg allowing for spacing out the writes to avoid wear, but flash could allow you to do write caching when you normally wouldn't trust it, because it won't go away if you lose power.

There is no flash wear myth. If it was a myth, they never would have gone to all that trouble. The whole point behind Static Wear Leveling is to mitigate a very significant and real weakness in the storage medium.

The fact that flash is only really well suited for infrequent writes and frequent non-contiguous reads doesn't bode well for its utility in OLTP applications.