Technology Lab —

Ext4 filesystem hits Android, no need to fear data loss

Google has adopted Ext4 for Android and is shipping the filesystem on its …

Google's new Nexus S smartphone is the first Android device to use the Ext4 filesystem. The company published a statement on the official Android developer blog earlier this month to discuss how adoption of Ext4 on Android will impact third-party application developers.

In a follow-up post last week, Ext4 developer Ted T'so commented on the transition and offered some further clarification regarding concerns about fsync data loss issues, which he says pose minimal risk on Android due to the higher level of quality assurance testing.

An expert on filesystem development, T'so played a key role in developing Ext4, the current generation of the Linux kernel's standard filesystem. He was hired earlier this year by Google when the search giant was transitioning its server storage infrastructure from Ext2 to Ext4. He says that he didn't influence the decision to use Ext4 in Android, but provided some advice and guidance to the Android team after the decision was made.

Most Android devices currently use YAFFS, a lightweight filesystem that is optimized for flash storage and is commonly used in mobile and embedded devices. The problem with YAFFS, T'so explained in his blog entry, is that it is single-threaded and would likely "have been a bottleneck on dual-core systems." Concurrency will be important on next-generation Android devices that use multi-core ARM processors. We expect to see dual-core Android devices, including tablets, announced as early as CES.

As Tim Bray explained on the Android developer blog, applications that use the platform's high-level storage abstractions will generally not have to worry about the transition. Developers who are accessing the filesystem directly will have to be mindful about Ext4's buffering behavior and make sure that the data is actually reaching persistent storage in a timely manner so that it won't be lost in the event of a system failure.

T'so says that there isn't much need for concern. Google and the handset makers will catch platform-level filesystem reliability issues, ensuring that the high-level storage APIs are safe. He also believes that the significant amount of product QA conducted by the vendors will reduce the risk of random crashes. Short of users yanking out the battery, he says, it's unlikely that Android devices will routinely run into the kind of system failure that causes data loss for applications that don't properly use fsync.

T'so also addressed the question of why Google would pass on Oracle's Btrfs, which is expected to eventually displace Ext4 in the future. Btrfs simply isn't mature enough yet for use in production. Canonical considered using Btrfs as the default filesystem in Ubuntu 10.10, but postponed adoption after similarly deciding that it needed more time in the oven. Nokia and Intel have adopted Btrfs for MeeGo, though it's unclear if they will stick with that decision when MeeGo ships on actual consumer devices. It's clear that Ext4 still has an important role to play while the issues in Btrfs are being ironed out.

43 Reader Comments

Most Android devices currently use YAFFS, a lightweight filesystem that is optimized for flash storage and is commonly used in mobile and embedded devices. The problem with YAFFS, T'so explained in his blog entry, is that it is single-threaded and would likely "have been a bottleneck on dual-core systems." Concurrency will be important on next-generation Android devices that use multi-core ARM processors. We expect to see dual-core Android devices, including tablets, announced as early as CES.

"Google's new Nexus S smartphone is the first Android device to use the Ext4 filesystem."

Aha! Not strictly true. Galaxy S owners using the Voodoo lagfix have been using ext4 for quite some time now. Perhaps Samsung's engineers were impressed with supercurio's work with bringing the file system to Android.

T'so says that there isn't much need for concern. Google and the handset makers will catch platform-level filesystem reliability issues, ensuring that the high-level storage APIs are safe. He also believes that the significant amount of product QA conducted by the vendors will reduce the risk of random crashes. Short of users yanking out the battery, he says, it's unlikely that Android devices will routinely run into the kind of system failure that causes data loss for applications that don't properly use fsync

I really hope I'm reading that wrong. On an embeded device, which is what a phone should be designed as, you should take a conservative approach on a filesystem. It should not be up to individual developers to make sure data is flushed in a timely manner by default; rather it should be up to the developer to over-ride the handing in those few, rare circumstances when they must.

Perhaps Samsung's engineers were impressed with supercurio's work with bringing the file system to Android.

Samsung? The company that gave us the wonders of RFS, a fat16 derived (and somehow, worse) filesystem for use on a Linux-powered device?

topham wrote:

Quote:

T'so says that there isn't much need for concern. Google and the handset makers will catch platform-level filesystem reliability issues, ensuring that the high-level storage APIs are safe. He also believes that the significant amount of product QA conducted by the vendors will reduce the risk of random crashes. Short of users yanking out the battery, he says, it's unlikely that Android devices will routinely run into the kind of system failure that causes data loss for applications that don't properly use fsync

I really hope I'm reading that wrong. On an embeded device, which is what a phone should be designed as, you should take a conservative approach on a filesystem. It should not be up to individual developers to make sure data is flushed in a timely manner by default; rather it should be up to the developer to over-ride the handing in those few, rare circumstances when they must.

The ext4 developers have had an attitude for quite a while that they don't give a shit about fsync causing data loss, as "it's the app writers fault" (never mind that doing it 'right' is just a pain, and doesn't need to be done for ext3).

T'so says that there isn't much need for concern. Google and the handset makers will catch platform-level filesystem reliability issues, ensuring that the high-level storage APIs are safe. He also believes that the significant amount of product QA conducted by the vendors will reduce the risk of random crashes. Short of users yanking out the battery, he says, it's unlikely that Android devices will routinely run into the kind of system failure that causes data loss for applications that don't properly use fsync

I really hope I'm reading that wrong. On an embeded device, which is what a phone should be designed as, you should take a conservative approach on a filesystem. It should not be up to individual developers to make sure data is flushed in a timely manner by default; rather it should be up to the developer to over-ride the handing in those few, rare circumstances when they must.

I'd expect the Android database API will hide syncing almost entirely from the developer. Many (most?) apps don't need raw filesystem access, so those cases are taken care of. If the dev does want raw filesystem access (e.g., storing stuff on the SD card), then that's one of the few, rare cases where they'll have to do it themselves, although google could still probably hide syncing in the API for many of those cases too. Finally, if the app is using native code, then the dev is obviously on their own.

T'so says that there isn't much need for concern. Google and the handset makers will catch platform-level filesystem reliability issues, ensuring that the high-level storage APIs are safe. He also believes that the significant amount of product QA conducted by the vendors will reduce the risk of random crashes. Short of users yanking out the battery, he says, it's unlikely that Android devices will routinely run into the kind of system failure that causes data loss for applications that don't properly use fsync

I really hope I'm reading that wrong. On an embeded device, which is what a phone should be designed as, you should take a conservative approach on a filesystem. It should not be up to individual developers to make sure data is flushed in a timely manner by default; rather it should be up to the developer to over-ride the handing in those few, rare circumstances when they must.

It is possible to disable delayed allocation (mount -o nodelalloc), which (I think) caused much of the data loss uproar.

The ext4 developers have had an attitude for quite a while that they don't give a shit about fsync causing data loss, as "it's the app writers fault" (never mind that doing it 'right' is just a pain, and doesn't need to be done for ext3).

If you write code that violates an API's contract, your program has a bug even if it doesn't manifest itself now and you're responsible for problems that occur in the future due to that bug.

Currently i only have one strong irk with the ext4fs. For the time being it seems that it is journaling the ext4 filesystem XOR SSD trimming support. Fortunately, that, will currently not affect Android.

The ext4 developers have had an attitude for quite a while that they don't give a shit about fsync causing data loss, as "it's the app writers fault" (never mind that doing it 'right' is just a pain, and doesn't need to be done for ext3).

Well, ext4 was only as broken as XFS, JFS, and pretty much every other filesystem that uses delayed allocation.

This one's so done, the carcass is in the bin and the forks are in the dishwasher by now.

Ext4 revealed that some programmers are crappy, and that Linux really needs fsync to actually work on the specified file rather than flush all pending disk operations, so that developers will actually call fsync.

It was fixed in the meantime with some code which makes ext4 slower when compared with XFS or JFS, but safer.

Meanwhile, the programs that ext4 showed as broken are still broken if you use a high-performance filesystem, and we still don't have an fsync call which operates on just one file efficiently.

Frankly, fixing fsync would have been better. But this is the real world, and we take what we can get.

Speaking of the real world... This is the FS on a phone. A phone which most apps will never see or care about the FS on, as they run in a sandbox and have no real access to such things.

It's nice to see that phones are rapidly becoming this powerful, but really, for all but the most technical of users, it matters not. Take this article as a sign of the pace of development, rather than as a place to have a filesystem urination contest, maybe?

The ext4 developers have had an attitude for quite a while that they don't give a shit about fsync causing data loss, as "it's the app writers fault" (never mind that doing it 'right' is just a pain, and doesn't need to be done for ext3).

If you write code that violates an API's contract, your program has a bug even if it doesn't manifest itself now and you're responsible for problems that occur in the future due to that bug.

Except ext3, the predominant Linux filesystem, violates that API contract. The ext3 fsync implementation ignores the file descriptor argument, and will sync ALL outstanding data, which can cause very undesirable behavior (hence the uproar that occurred when Firefox started using fsync on Linux). And this is an improvement over the old behavior, which just ignored fsync entirely.

Are application developers supposed to be working around kernel "bugs"? Do they need to check what filesystem the file exists on, and then branch to filesystem-specific code to handle writing out data?

@5: The typical situation is “I’m using my phone and by accident I let it fall to the ground, and the battery escapes”.

Good point. I’ve had pretty good luck with my Nexus One (despite my butter fingers), but the battery has escaped once or twice. So yes, it’s something to be concerned about.

I should check whether Gingerbread forces a sync(2) when the system suspends and the screen is turned off. Otherwise, the data could be left buffered for a long time and then if the device is dropped, data written potentially hours earlier might get lost if the application is badly written. [...]

YAFFS goes to great lengths to support wear leveling, flash errors, etc. I presume Ext4 just uses a generic block layer - isn't that going to significantly affect reliability?

No. File system wear-leveling hasn't been used in CE devices in a very long time, so the fact that YAFFS supports it doesn't really matter. Instead modern NAND is either hardware wear leveled (as in SD) or software wear leveled at a level below the file system (typically using a flash translation layer or FTL). File system based leveling larger went away in CE devices because all of them must support FAT32 somewhere due to Windows, and if you're including support for FAT32, then you obviously have either hardware or FTL based leveling anyway.

YAFFS goes to great lengths to support wear leveling, flash errors, etc. I presume Ext4 just uses a generic block layer - isn't that going to significantly affect reliability?

No. File system wear-leveling hasn't been used in CE devices in a very long time, so the fact that YAFFS supports it doesn't really matter. Instead modern NAND is either hardware wear leveled (as in SD) or software wear leveled at a level below the file system (typically using a flash translation layer or FTL). File system based leveling larger went away in CE devices because all of them must support FAT32 somewhere due to Windows, and if you're including support for FAT32, then you obviously have either hardware or FTL based leveling anyway.

Sorta.

With a SD card you are not using the flash directly as a memory technology device. Your using a SD card as block storage.

How this works is that the SD or CF card provides hardware emulation of block storage (FTL or whatever you want to call it) and logically maps the flash used by it internally. The OS and software will never see it as a 'real' flash device. Instead it looks and behaves similar to a hard drive. Since the OS cannot really see what is going on in the flash device then it makes wear leveling kinda pointless. You have to depend on the hardware to 'do the right thing'. Even if you try to do wear leveling schemes on a SD device you have no idea that the flash device is actually going to benefit from it.

You can tell this because raw flash devices are incompatible with file systems designed for block devices. So without this block device emulation in the hardware then you would not be able to directly run file systems like FAT32 or Ext4.

Being able to access flash as flash devices is still really common on embedded systems. To use file systems like Fat or Ext4 on those you have to have the OS use special drivers to map the 'block device' used by the FS to the MTD device. Then you can take advantage of smart OS wear leveling schemes "underneath" the file system. The real problem with a lot of things like YAFFS is that they are designed specifically for small amounts of flash on systems that have very low amounts of RAM. This requires a lot of design compromises and they don't scale upward in size very well. They will be slow, untested, and not be efficient.

I believe with phone systems that have 2GB+ storage onboard then they are probably not going to use raw MTD devices anyways. Instead they will use a SD-like device that is soldered directly to the board. So the OS will see them like block devices as you indicated.

can someone explain to me the advantages of ext4 over ext2 for a smartphone?

Ext4 is extent based, offers better data protection, and won't require a fsck every time you run out of battery power or force the device to reboot. When it does require a fsck it will be able to do it in a fraction of the time.

Quote:

I'd think that the log would both slow it down and make the flash more fragile? With the primary benefit being no need for fsck on most non clean boots.

The log makes it more robust since it will help a lot to protect the file system from corruption.

With phones being shutoff roughly is a way of life. People pull batteries, force power off, run them till the battery dies abruptly, etc etc.

Plus we are starting to see phones shipping with 16-32GB worth of storage. During ext2 era that would of been a lot of storage.

It is really quite ancient and the wives tales about ext2 being preferable on flash devices due to the lack of journal is really not true. There is some logic behind it, but it really is not something that you should worry about. If it's a problem then it has to do with shitty hardware and shitty flash devices rather then the file system's fault and your going to run into other problems with that flash no matter what.

Ext4 revealed that some programmers are crappy, and that Linux really needs fsync to actually work on the specified file rather than flush all pending disk operations, so that developers will actually call fsync.

It was fixed in the meantime with some code which makes ext4 slower when compared with XFS or JFS, but safer.

Meanwhile, the programs that ext4 showed as broken are still broken if you use a high-performance filesystem, and we still don't have an fsync call which operates on just one file efficiently.

Frankly, fixing fsync would have been better. But this is the real world, and we take what we can get.

There's nothing wrong with fsync() on Linux, in general. The problem is fsync() on ext3; on ext4, XFS, BTRFS and so on it's fine. Unfortunately, there's apparently some kind of fundamental design limitations in ext3 that makes it very hard if not impossible to fix fsync() on that fs.

Also, AFAIK XFS and BTRFS also implement the similar fixes that ext4 nowadays does with the (default) auto_da_alloc mount option.

Because their setup is such that they don't want to pay the overhead of journaling. Since all data is replicated to multiple machines, in the rare event of a crash I guess they just reinstall the machine and get the data from other nodes.

Except ext3, the predominant Linux filesystem, violates that API contract. The ext3 fsync implementation ignores the file descriptor argument, and will sync ALL outstanding data, which can cause very undesirable behavior (hence the uproar that occurred when Firefox started using fsync on Linux).

Yes, ext3 fsync() behavior sucks. News at 11.

Quote:

And this is an improvement over the old behavior, which just ignored fsync entirely.

AFAIK this has never been true.

Quote:

Are application developers supposed to be working around kernel "bugs"?

Ideally, no (duh..). In the real world, sometimes they unfortunately have to do that.

Here's an honest question. I have been programming C++ at a basic level for years, but have mostly stuck to the standard library and I have certainly never written anything for a phone (although I've done very low level embedded programming with C).

I assume the ostream flush method and the stdio fflush function do the same thing fsync does on an ext3 filesystem. Am I wrong in thinking that they will achieve the same effect on ext4 as well? If I use fflush on ext4, am I getting the same behavior that I would if I was using unistd.h instead of stdio.h?

I just recently installed Ubuntu in a VM and this is my first experience with ext4 at all.

I assume the ostream flush method and the stdio fflush function do the same thing fsync does on an ext3 filesystem.

Not so. The fflush function causes the buffers in the local process to be written out via the OS I/O routines. This generally makes the changes visible to other processes, but it may still be cached by the OS and not yet written out to the physical hardware. The fsync call is to guarantee that the OS-level write cache has been flushed out to the physical hardware. These things normally happen automatically, you only need to call them explicitly if you want to ensure that they have been done before your program does anything else.

Not so. The fflush function causes the buffers in the local process to be written out via the OS I/O routines. This generally makes the changes visible to other processes, but it may still be cached by the OS and not yet written out to the physical hardware. The fsync call is to guarantee that the OS-level write cache has been flushed out to the physical hardware. These things normally happen automatically, you only need to call them explicitly if you want to ensure that they have been done before your program does anything else.

Ahh, thanks. That makes more sense. I've never written anything that required this level of control over the filesystem, so my experience over those calls is limited.

I would like to play around with some phone development and I have an iPhone, so iOS is probably what I will look at sooner, but I would like to be familiar with Android at some point too.

It is possible to disable delayed allocation (mount -o nodelalloc), which (I think) caused much of the data loss uproar.

Funny you should mention that. ext4 ran smoothly on my hdd with 2.6.35, but not so much on my ssd: I have had errors=remount-ro on forever, and now that i switched to the ssd, my /home partition has shifted into ro mode a couple of times after delalloc failures. What truly mystifies me is why linux will try a lot to read a bad hdd sector, but one little delalloc failure and it spews out "I LOST DATA!" Um, retry? Surely if you're still allocating and printing inode numbers to my log, you also have everything you need to put it somewhere else. (And if we had online fsck, start one...)

At any rate, I stuffed nodelalloc into the fstab for the ssd volumes and it's been running okay since. Time will tell if it's a permanent fix. It would definitely not be my choice in filesystem for a phone, though, given its observed issues on my desktop.

What truly mystifies me is why linux will try a lot to read a bad hdd sector, but one little delalloc failure and it spews out "I LOST DATA!" Um, retry?

The only way to make a failed read safely succeed is to repeatedly re-read in the hope that it starts working.The only way to make a failed write safely succeed is to get a new disk to write to.

Hence the difference in strategy.

Here's another question from someone not in the know.

In the case of the flash drive, what is causing the delayed allocation error? blueshifter mentions ext4's inability to handle this smoothly. I am guessing that the implicit assumption held by blueshifter is that the error is indicative of a transient problem; perhaps the write buffer can't write within a specified period of time, maybe? DrPizza's comment seems to imply that such an error would not normally occur unless the storage medium is permanently unavailable for some reason.

I don't know the specifics of ext4; it could be a bug causing a spurious, transient write failure for all I know. But in general, filesystems should treat underlying read errors differently from write errors, so the different behaviour blueshifter was describing does, fundamentally, make sense.

Plus btrfs needs a fsck that actually fixes errors, because they do happen...

BTRFS needs so much more than FSCK that it's not funny. Almost every week there's a new critical bug reported, or a new panic/crash/hang. Even though this FS is in the latest Linux kernel, it's FAR from being ready for prime time.

To be accurate JHFS+ (ie journaled HFS+).(Which, out of interest, is NOT threaded, not even across multiple volumes. As far as I can tell there is a single file system lock for anything that touches file systems, not even a per-volume lock. This can easily become a real bottleneck on multicore macs, though claiming it to be a bottleneck on next year's phones strikes me as unlikely.I suspect/hope that, at the very least, 10.7 will have per-volume locks --- which should at least eliminate the common OSX effect of the flash drive being blocked and slow because of simultaneous operations on a magnetic drive. Long term, there has to be something beyond JHFS+ that is better multi-core capable and offers better end-to-end protection. If I had to guess, my money would be on Apple adopting btrfs, but honestly, who knows? Maybe Jobs will negotiate some sort of deal with Oracle under which it makes sense for Apple to get back on the ZFS wagon?)

If I had to guess, my money would be on Apple adopting btrfs, but honestly, who knows? )

Extremely unlikely. Btrfs is part of the Linux kernel and thus is licensed GPLv2. Native Linux file systems are very integrated into the kernel and share major functionality with multiple different things. This sharing of functionality is one of the positive benefits you get from being open source and Linux having no internal ABI. If there is another driver that does what you want then you can just use the functionality that it provides in your own driver.

What I am trying to say is that if you want to port a Linux FS to your OS your going to have to port major portions of the Linux kernel to your system.

This is what Sun Microsystems found out about Lustre....

Quote:

Maybe Jobs will negotiate some sort of deal with Oracle under which it makes sense for Apple to get back on the ZFS wagon?

ZFS is a much more acceptable license for Apple as it is similar to how Mozilla operates... you have to keep the code open source, but you can integrate it into your closed source project. Possibly much of the work of porting it is already done by FreeBSD, since Apple's OS X kernel is based on a major portion of FreeBSD's kernel and uses the BSD VFS for providing a POSIX-ish interface for the Unix side of things in OS X.

Apple had a ZFS project going and IIRC Solaris-ZFS folks implemented the functionality necessary to support unique OS X features like resource forks and case sensitivity..

ZFS stands for "Zettabyte File System". It was designed and implemented by a team at Sun Microsystems led by Jeff Bonwick. Support has been included in Darwin 9, but as of Snow Leopard, any mentions of ZFS have been removed by Apple. We hope to use more ZFS and its features in PureDarwin if possible.

Update: The Mac port of ZFS is no longer hosted on http://zfs.macosforge.org/ but the zfs-macos project is continuing to host the codebase.

Makes sense from a commercial perspective. File systems are not a big selling point for consumer electronics and on desktops they don't really have much impact. They are also expensive to develop and take years to mature, so it's understandable why Apple would be happy to stick around with a file system from the FAT32 era. Having gobs of memory for nice huge file system cache is all you really need for good performance for 99% of people out there on the desktop.

YAFFS goes to great lengths to support wear leveling, flash errors, etc. I presume Ext4 just uses a generic block layer - isn't that going to significantly affect reliability?

No. File system wear-leveling hasn't been used in CE devices in a very long time, so the fact that YAFFS supports it doesn't really matter. Instead modern NAND is either hardware wear leveled (as in SD) or software wear leveled at a level below the file system (typically using a flash translation layer or FTL). File system based leveling larger went away in CE devices because all of them must support FAT32 somewhere due to Windows, and if you're including support for FAT32, then you obviously have either hardware or FTL based leveling anyway.

Sorta.

What do you mean sorta? You just restated what you quoted

drag wrote:

Being able to access flash as flash devices is still really common on embedded systems.

I don't really care if theres still lots of flash file systems sitting out there in factory floors, air craft control systesms, etc. We're talking about phones here, not industrial equipment. As I said above, CE flash devices have to be able to talk FAT somewhere, which means they're going to have some kind of FTL.

drag wrote:

I believe with phone systems that have 2GB+ storage onboard then they are probably not going to use raw MTD devices anyways. Instead they will use a SD-like device that is soldered directly to the board. So the OS will see them like block devices as you indicated.

Some do indeed use SD devices soldered directly to the board. Most though do it all in software using an FTL in their NAND driver. If you buy an ARM chip that supports NAND flash, it'll almost certainly come with a license for one of the common FTLs such as Samsung's Whimory (used on ipods, wince devices, and some android phones) or one of the Chinese equivalents. Unless you're interfacing with your NAND chips by bit banging a GPIO line, you're probably going to get an FTL along with the chip these days.

I don't really care if theres still lots of flash file systems sitting out there in factory floors, air craft control systesms, etc. We're talking about phones here, not industrial equipment.

And they're not all that different. Flash file systems are used almost all the time for OneNAND devices that host the OS in most phones these days, and use clunkers like JFFS2, YAFFS, or newer ones like UBIFS. All designed around raw NAND devices.

Quote:

As I said above, CE flash devices have to be able to talk FAT somewhere, which means they're going to have some kind of FTL.

Depends on what they're exposing. In most cases, on Linux devices it's via an abstraction layer that translates FAT commands into native filesystem commands. USB Gadget Filesystem, I believe. My N900 uses it to export both the FAT32 partition on the internal eMMC and the SD card.

Quote:

Some do indeed use SD devices soldered directly to the board. Most though do it all in software using an FTL in their NAND driver.

Again, it depends on the OS and the device. Most devices with high volume storage, like my N900, use an eMMC, which just turns NAND into an SD card that can't be removed. In fact, if you see a device with more storage than the highest capacity OneNAND devices out, it's almost definitely an eMMC device.