That’s an interesting question, and I figure it’s worth its own top-level entry, as opposed to a reply in the comment stream. One of the interesting design questions in any OS or Computer Architecture is where the abstraction boundaries should be drawn and which side of an abstraction boundary should various operations be pushed. Linus’s arguments is that there a flash controller can do a better job of wear leveling, including detecting how “worn” a particular flash cell might be (for example, perhaps by looking at the charge levels at an analog level and knowing when the last time the cell was programmed), and so it doesn’t make sense to try to do wear leveling in a flash file system. Some responsibilities of flash management, such as coalescing newly written blocks into erase blocks to avoid write amplification can be done either on the SSD or in the file system — for example, by using a log-structured file system, or some other copy-on-write file system, instead of a rewrite-in-place style file system, you can essentially solve the write amplification problem. In some cases, it’s necessary let additional information leak across the abstraction — for example, the ATA TRIM command is a way for the file system to let the disk know that certain blocks no longer need to be used. If too much information needs to be pushed across the abstraction, one way or another, then maybe we need to rethink whether the abstraction barrier is in the right place.

In addition, if the abstraction has been around for a long time, changing it also has costs, which has to be taken into account. The 512 byte sector LBA abstraction has been around long time, and therefore dislodging it is difficult and costly. For example, the same argument which says that because the underlying hardware details are changing between different generations of SSD is all of these details should be hidden in hardware, was also used to justify something that has been a complete commercial failure for years if not decades: Object Based Disks.

One of the arguments of OBD’s was that the hard drive has the best knowledge of how and where to store an contiguous stream of bytes, and so perhaps filesystems should not be trying to decide where on disk an inode should be stored, but instead tell the hard drive, “I have this object, which is 134 kilobytes long; please store it somewhere on the disk”. At least in theory the HDD or SSD could handle all of the details of knowing the best place to store the object on the spinning magnetic media or flash media, taking into account how worn the flash is and automatically move the object around in the case of an SSD, and in the case of the HDD, the drive could know about (real) cylinder and track boundaries, and store the object in the most efficient way possible, since the drive has intimate knowledge about the low-level details of how data is stored on the disk.

This theory makes a huge amount of sense; but there’s only one problem. Object Based Disks have been proposed in academia and advanced R&D shops of companies like Seagate et.al. have been proposing them for over a decade, with absolutely nothing to show for it. Why? There have been two reasons proposed. One is that OBD vendors were too greedy, and tried to charge too much money for OBD’s. Another explanation is that the interface abstraction for OBD’s was too different, and so there wasn’t enough software or file systems or OS’s that could take advantage of OBD’s.

Both explanations undoubtedly contributed to the commercial failure of OBD’s, but the question is which is the bigger reason. And the reason why it is particularly important here is because at least as far as Intel’s SSD strategy is concerned, its advantage is that (modulo implementation shortcomings such as the reported internal LBA remapping table fragmentation problem and the lack of ATA TRIM support) filesystems don’t need to change (much) in order to take advantage of the Intel SSD and get at least decent performance.

However, if the price delta is a stronger reason for its failure, then the X25-M may be in trouble. Currently the 80GB Intel X25-M has a street price of $400, so it costs roughly $5 per gigabyte. “Dumb” MLC SATA SSD’s are available for roughly half the cost/gigabyte (64 GB for $164). So what does the market look like 12-18 months from now? If “dumb” SSD’s are still available at 50% of the cost of “smart” SSD’s, it would probably be worth it to make a copy-on-write style filesystem that attempts to do the wear leveling and write amplification reduction in software. Sure, it’s probably more efficient to do it in hardware, but a 2x price differential might cause people will settle for a cheaper solution even if isn’t the absolutely best technical choice. On the hand, if prices drop significantly, and/or “dumb” SSD’s completely disappear from the market, then time spent now optimizing for “dumb” SSD’s will be completely wasted.

So for Linus to make the proclamation that it’s completely stupid to optimize for “dumb” SSD’s seems to be a bit premature. Market externalities — for example, does Intel have patents that will prevent competing “smart” SSD’s from entering the market and thus forcing price drops? — could radically change the picture. It’s not just a pure technological choice, which is what makes projections and prognostications difficult.

As another example, I don’t know whether or not Intel will issue a firmware update that adds ATA TRIM support to the X25-M, or how long it will take before such SSD’s become available. Until ATA TRIM support becomes available, it will be advantageous to add support in ext4 for a block allocator option that aggressively reuses blocks above all else, and avoids using blocks that have never been allocated or used before, even if it causes more in-file system fragmentation and deeper extent allocation trees. The reason for this is at the moment, once a block is used by the file system, at least today, the X25-M has absolutely no idea whether we still care about the contents of that block, or whether the block has since been released when the file was deleted. However, if 20% of the SSD’s blocks have never been used, the X25-M can use 20% of the flash for better garbage collection and defragmentation algorithms. And if Intel never releases a firmware update to add ATA TRIM support, then I will be out $400 out of my own pocket for an SSD that lacks this capability, and so adding a block allocator which works around limitations of the X25-M probably makes sense. If it turns out that it takes two years before disks that have ATA TRIM support show up, then it will definitely make sense to add such an optimization. (Hard drive vendors have been historically S-L-O-W to finish standardizing new features and then letting such features enter the market place, so I’m not necessarily holding my breath; after all, the Linux block device layer and and file systems have been ready to send ATA TRIM support for about six months; what’s taking the ATA committees and SSD vendors so long? <grin>

On the other hand, if Intel releases ATA TRIM support next month, then it might not be worth my effort to add such a mount option to ext4. Or maybe Sandisk will make an ATA TRIM capable SSD available soon, and which is otherwise competitive with Intel, and I get a free sample, but it turns out another optimization on Sandisk SSD’s will give me an extra 10% performance gain under some workloads. Is it worth it in that case? Hard to tell, unless I know whether such a tweak addresses an optimization problem which is fundamental, and whether or not such a tweak will either be unnecessary, or perhaps actively unhelpful in the next generation. As long as SSD manufacturers force us treat these devices as black boxes, there may be a certain amount of cargo cult science which may be forced upon us file system designers — or I guess I should say, in order to be more academically respectable, “we will be forced to rely more on empirical measurements leading to educated engineering estimations about what the SSD is doing inside the black box”. Heh.

27 thoughts on “Should Filesystems Be Optimized for SSD’s?”

I don’t get why you have to go through the contortions of specifying the number of heads and size of cylinders.

What’s wrong with going into fdisk, using the ‘u’ command to deal in sectors instead of cylinders, and just specifying the 1st sector of your partition to be 1024? Shouldn’t that be enough to ensure alignment?

@1: What’s wrong with going into fdisk, using the ‘u’ command to deal in sectors instead of cylinders, and just specifying the 1st sector of your partition to be 1024? Shouldn’t that be enough to ensure alignment?

You can do that, but it gets annoying for subsequent partitions (I’m not that great at recognizing arbitrary multiples of 1024 in my head), and fdisk will kvetch at you if a partition doesn’t end on a cylinder boundary. It should be safe to ignore the warnings, but I find it more convenient to specify C/H/S geometrics that causes fdisk to work with me, instead of having me manually specify everything by hand and having fdisk complaining every step of the way.

I suppose if you are only dropping a single partition on the disk, using the ‘u’ command and simply specifying a starting sector of 256 or 512 (you don’t need 1024 unless you really want a 512k alignment just out of paranoia), and accepting the default “end of disk” as the last sector is simpler. Normally though I’m creating more than just the singleton partition on the disk; if I only need the single partition, I’d probably just create a filesystem using /dev/sdb directly, and dispense with the partition table altogether.

That’s an interesting point, though one I expected to see in the alignment post instead of here. If you were to use /dev/sdb directly as an LVM PV with an appropriate metadata size, would that completely avoid the mess of cylinders and sectors, which as I understand LVM doesn’t recognize anyway?

I for one prefer not to use traditional partitions except to bootstrap LVM (or ZFS). If SSDs work fine as one big PV, that’s a big win for my kind of usage pattern.

As for aggresively reusing the previously used blocks instead of using new ones: it would be a nice feature to have also for other purposes: things like running a virtual machine with virtual disks in .qcow2 data files or using the filesystem on a snapshotted LVM volume. I would be glad to have it at least as a mount option or a superblock parameter.

Panasas has used the Object Based Disk interface to its storage blades
for several years now, and has been working with the standards bodies
to see that to fruition. That is a long row to hoe, but heck, we’re giving
it a shot because we like the OSD abstraction.

I’ll note that OSDv2 has a CLEAR command similar
to the ATA TRIM command you like. OSDv1 was ratified as a command
set for SCSI in 2005, and OSDv2 is out for letter ballot now and should
be ratified this year (2009)

As well as working with the slow moving standards bodies, we’ve been
working to add OSD code to the Linux code base so folks can play with
it and take it in directions we haven’t thought of, or haven’t had time to
pursue. We had to do a bunch of work in the SCSI and block layers of
Linux because they are fairly intertwined, but that work is in (I think)
2.6.29 – I could be off by one. We’ve also got an “exofs” that is a little
file system that sits on top of the OSD command set. You can find out
about this at open-osd.org

My point is that appearance in the Linux code base is perhaps more
important than appearance in a standard, and so we’ve been working
both angles to help foster OSD.

As a distributed file system guy
(I built the Sprite file system in ’80s, and have been working on the
Panasas distributed file system, which is a commercial success, for
the last several years) I agree with you that having the right abstractions
is important. However, introducing new ones is hard because of the
intertia behind established interfaces. Witness your own issue with
the fixed-size cylinder abstraction that comes from counting sectors
in a track. Disks have had variable numbers of sectors per track depending
on the zone (distance from edge of the platter) for a long time. Yet, the
ancient abstraction of a fixed-sized cylinder managed to interfere with
your desire to optimize to a much newer abstraction, the erase block. Cute.

As for SSD optimization, I think that “yes”, file systems will have to learn
about how their IO patterns interact with the behavior of the SSD. We’ve
already had to learn about how our IO patterns interact with the track
cache, for example. I wish I had better control over that, but I don’t.

However, “no”, I don’t think you want to spend a great deal of time
getting it perfect for device X, because devices Y and Z will appear on
the market soon and may render your approach invalid. A small number
of general things like “4K blocks are a natural write unit” will probably carry
you a lot further than trying to help with the larger erase blocks. Maybe.

Also, in storage systems it is ultimately more about reliability than performance.
Cheaper, MLC (multi-level cell) FLASH has limited write cycles and it is
easier for bits to go bad. I would never use one in a general purpose
file system. A cheap flash device with a dumb controller that can’t even
compensate with basic wear leveling would be a real time bomb, but I
think you’ll find it hard to find a flash device without some sort of wear
leveling built in.

So, for me, the interesting challenge is to figure out how to deploy
SSD in just the right way in very large storage systems where you need
to blend all of main memory caching, SSD, and spinning disk.

You are talking of future ATA TRIM command, there seems to be a current CFA ERASE SECTORS which would work fine to tell the SSD the sector is unused (whether the SSD want to erase now or later is not a real problem).
Can’t tell if your SSD support CFA, I do not have this model.

In the process of developing an application-layer object storage protocol within the SNIA (Storage Networking Industry Association), I’ve had the occasion to work quite closely with many of the OSD people. The big challenge that I see to adoption of SCSI OSD at large is not the matter of abstraction, but the fact that it interacts very poorly with conventional data protection (i.e. RAID).

OSD can plug in rationally at the inode level of a conventional file system, but may have the problem of interacting badly with applications assuming “conventional disk”. Up the stack from the spindle you see assumptions at each layer based on the characteristics of spinning magnetic media, ones that don’t necessarily hold when replaced with object-disks or, as you point out, SSDs. While at an abstract level it’s clear who’s in the wrong (the higher levels of the stack making assumptions about the underlying layers not guaranteed by the interface), it doesn’t mean that’s the easiest place to fix.

One way to fix this is by amending the interface definition to include the unspoken assumptions of the higher-levels; this is pushing the wear-leveling smarts down into the SSD. Another way to fix is to suggest a different interface altogether, which is why the SNIA XAM standard is so very different from what OSD defines. I, at least, see object storage as relevant at the application level where richer metadata can be expressed than at the SCSI layer.

Essentially this post says that you’re waiting on the SSD makers to hash out what direction they’re going to take with their implementations. If you look around, I think you’ll find that you’re really waist-deep in the SSD implementers’ decision on what that will be.

The vendors want to produce a product that is a combination of some of the following: fast, cheap, and reliable. At the moment, they’re staring at the stone wall of current file systems and deciding that they must do everything in hardware if they are to meet their performance and reliability specs. On the other side of the wall, you’re staring at the implementations out there and saying it may not make sense to implement changes in how things are done until the vendors decide what they will do. The natural evolution of this is that everything will go into the hardware because the vendors can’t rely on the FS and the file systems will remain in “legacy” spinning disk mode because that is what the SSDs will expect for compatibility. As optimization goes, this leads to a local maximum in the performance, value, and reliability world that may be far from the best case.

If you were to decide to do aggressive SSD optimization, the vendors targeting a low price would be happy to throw out some hardware complexity, bringing down costs. Reliability will become cheaper or better. The vendors that target high performance will build on top of the optimized FS for higher performance than they can by optimizing for spinning disk behavior (or poor block alignment and allocation strategies, lack of write combining, and whatever other host assumptions they are compensating for).

Is this the wheel of incarnation spinning rolling on top of the file system implementer? I don’t think it is. I could be wrong, but I think of it as opening the flood gates. You make a point that if devices go away from “dumb”, the work is wasted. However, if the work was never done, the outcome is heavily tilted towards one conclusion. You can’t drive fast unless you have paved roads. You don’t need paved roads if none of your cars can go fast. If you pave the file system road, the vendors will build the cars. If not, they’ll do the best with the roads they have. As you’ve noted, the lesson of Object Based Disks is that the software was not there. SSDs will undoubtedly succeed, but their future will be decided by file system implementers as much as hardware vendors.

You’re right in that as a filesystem implementer I do have some ability to control what hardware vendors will do. On the other hand, there is also the reality that 85% runs on some variant of Redmond-spawn. So on that basis (unless Windows 7 has a log-structured file system that they’ve kept secret from everyone) it seems likely that hard drive vendors will move towards smarter hard drives to accommodate Windows brain-damage. Unless of course, there are externalities like patents that might keep dumb SSD’s around as a low cost option.

Even if Linux and Apple worked together, the question is whether that would be enough of a market for dumb SSD, or some kind of raw flash device. A lot depends on how quickly the price for “smart” SSD’s like Intel’s X25-M fall, obviously.

I feel almost ashamed of the original blog posting now – it was full of grammatical errors. If only I’d known you would look at it…

Thanks for the extended reply (you snipped the actual question at the end though ;). From what you’ve posted here I think the answer is “it depends on whether people have the time to speculate as to whether you should act now”. Certainly kernel folks are starting to make changes to support 4K sector sizes ( http://lkml.org/lkml/2009/2/25/444 ) which while not directly SSD related, seems to be something people believe is going to make a difference. Some SSD vendors seem to be keen to push for dedicated filesystems ( http://www.theregister.co.uk/2009/01/08/sandisk_another_look_at_pssd/ ) so I don’t think the answer is clear cut.

“Until ATA TRIM support becomes available, it will be advantageous to add support in ext4 for a block allocator option that aggressively reuses blocks above all else, and avoids using blocks that have never been allocated or used before”

I think this will also help sparse files used as base for virtual disks for VMs.

“On the other hand, if Intel releases ATA TRIM support next month, then it might not be worth my effort to add such a mount option to ext4.”
means this that TRIM is working NOW with ext4 or is a mount option required?

SSDs with Indilinx Barefoot Controller (OCZ Vertex, Supertalent Ultradrive, both available) should have TRIM support.
A OCZ moderator says that next firmware (next week or month) has TRIM support, but it is not working with linux.http://www.ocztechnologyforum.com/forum/showpost.php?p=351938&postcount=35
a german Supertalent distributor says that the Ultradrives with current firmware has TRIM support.

Is there a easy way to check if TRIM is supported from the ssd (e.g. hdparm, dmesg) and if the kernel and filesystem is using it?

I’d like to know how the TRIM command will be passed through RAID controllers, such as the Intel Matrix many people use.

I have 2 160GB Gen2 Intel X25-M’s and I’ve been trying to figure out the best way to stripe and subsequently format them for a dual boot Windows 7 / Linux Mint system.

People seem to be testing using low level tools and saying 128K stripe sizes are best because that is the Intel native size, but it’s been hard finding guidance on what block size to use for the filesystem itself.

My first reaction would be to say a 256K block size would split the data evenly, but that seems terribly wasteful on devices with such limited capacity.

Does anyone have any ideas, for regular desktop usage, games, and dual booting from an Intel Matrix setup, what a suggested stripe/filesystem size would be? One possibly for maximum speed, and another for a reasonable speed/space comprimise?

> And if Intel never releases a firmware update to add ATA TRIM support, then
> I will be out $400 out of my own pocket for an SSD that lacks this capability,
> and so adding a block allocator which works around limitations of the X25-M
> probably makes sense. If it turns out that it takes two years before disks
> that have ATA TRIM support show up, then it will definitely make sense to
> add such an optimization. (Hard drive vendors have been historically
> S-L-O-W to finish standardizing new features and then letting such features
> enter the market place, so I’m not necessarily holding my breath; after all,
> the Linux block device layer and and file systems have been ready to send
> ATA TRIM support for about six months; what’s taking the ATA committees
> and SSD vendors so long?

Is this indeed working with TRIM enabled SSD’s today? I read somewhere in the OCZ forums, in a post by the hdparm author, that issues showed up when running on real hardware, causing it to be disabled for now.

> On the other hand, if Intel releases ATA TRIM support next month, then it
> might not be worth my effort to add such a mount option to ext4.

So, Ted, will we see such optimization now that Intel has officially left its early adopters out in the cold?

Also, how much of this as well as your previous posts on the subject also applies to USB thumb drives? Any (pun not intended) thumb rules to follow?

But hey, Kingston is now shipping 40 GB SSD’s with Intel 2-gen internals for around $100 or so. Not shipping with TRIM support but a new firmware is in the works. Getting mine on monday

I personally expect SSDs to become dumb in the future, as soon as they stop being pricey products made in USA and start being mass products made in China. The manufacturers will start putting the necessary “intelligence” into a windows driver (for almost no cost per unit) instead of into costly on-device silicon. At least, this is what happened to other devices: WLAN controllers (cf. early Prism chips with the ones used today), Laser printers (GDI), you name it.

One question keeps spinning in my head….Why scsi for semiconductor memory at all? These memories(ideally) should come closer to DRAM and not Disk…
Look at all the memories in market which are close to manufacturing (Pram,feram…), where are we heading…
About question should FS change? Why only filesystem shouldn’t the database?