Posted
by
kdawson
on Monday November 22, 2010 @10:56AM
from the early-days dept.

An anonymous reader writes "It's been known that ZFS is coming to Linux in the form of a native kernel module done by the Lawrence Livermore National Laboratory and KQ Infotech. The ZFS module is still in closed testing on KQ infotech's side (but LLNL's ZFS code is publicly available), and now Phoronix has tried out the ZFS file-system on Linux and carried out some tests. ZFS on Linux via this native module is much faster than using ZFS-FUSE, but the Solaris file-system in most areas is not nearly as fast as EXT4, Btrfs, or XFS."

hmmm, well the most obvious feature that ZFS has that Ext4 does not is check summing.

That feature is one reason why ZFS is better (it will tell you if your disk is going bad, and if you have a raid setup, it will go get the good data for you). However, this is also one reason why ZFS is slower... it spends time making sure your data is safe and that it always gives you the correct bits from your disk.

That single feature is why I run FreeBSD (looking forward to kFreeBSD/debian!) on my file server in a mirrored raid configuration. Yes, it is "slower", but I still pull data off that server at over 50MB/sec on my home gigabit lan! The specs on that server aren't great either... 2GB ram, and an old 1.6GHZ single core sempron.

I want to make a file server with FreeBSD 9.0 and ZFS, but I want full gigabit speeds. After I got my new Win7 machine, I can SMB copy 114MB/sec between my computer and my wife's with only 1.5% cpu. I'm addicted to speed. 10gb card/switches have been coming down in price to. Looking into those for a File Server.

By the time I get this going(prob in 2 years), DDR4 will be out, 22nm 4core low power server cpus will be out. ZFS + SSD + lots of memory = ftw.

it spends time making sure your data is safe and that it always gives you the correct bits from your disk.

You forgot to mention that you need to set that up in advanced as it doesn't do that by default and having to come up with settings for any given system is horribly time consuming, compared to just having a regular incremental backup operating every X hours.

Hard drives already do CRCs in hardware, so that they can detect errors themselves and reread if one is detected, or declare a failed read if repeated reads fail. How often is the extra complication of a software checksum going to help?

God, I'm sick of hearing about this 'checksumming' bullshit from ZFS proponents. What happens if a checksum says your data is corrupted? Yer, you need a mirror and if you don't have one then it really doesn't matter how many checksums you perform if you can't recover the data. It's the same for any fileystem you use. It's a nice to have meaning you'll know you have problems sooner but that is all that it does but in these days of cheap disk storage and cheap mirroring and backups people are not going to lik

i only have one computer connected to that server via gigabit, and i'm 90% sure that there is a problem with the gigabit chipset on that client computer. It's an older chipset that does't properly support jumbo frames. The harddisks/zfs aren't the bottleneck.

L2ARC is a HUGE performance improvement for many workloads, it essentially allows you to use faster disks to cache the most frequently used data. If they had combined the SSD and the 7200 RPM SATA drive and benchmarked a real world workload the ZFS implementation would have probably stomped the others because it would have used the SSD for the 'hot' data, the best you can do with btrfs is to place the metadata on the SSD.

Like I said, under most real world workloads the L2ARC will have significant impact. There will always be edge cases and artificial benchmarks that can swamp any cache, but I run a midsized enterprise on an array with only 8GB of cache and it absorbs 99.5% of the write workload and a fair percentage of the non-database read workload so a 64GB+ SSD would just be that much better. With L2ARC you can achieve a high 95-99% IOPS watermark with a small dollar investment, and because the cache is servicing most of

What features does ZFS have that ext4 doesnt? Its a simple question, but you had to act like an ass. Good job.

If I have a bicycle that I ride everywhere, and never seen nor heard of a car. I would not know what a car could do for me, would I? SO if someone comes along and says, Hey cars are cool, they are just a little more expensive. I would ask something like.. What features does a car have over a bicycle.

It's an interesting question, but not necessarily the right question. I'll explain what I mean. In some cases, a UDP connection with error-handling and retry mechanisms at each end will be faster than a TCP connection. They have the same feature set, but the results are different.

In this case, the question is surely "what features does ZFS have that (some other fs) does not, what is the cost for each feature, and for those features duplicatable outside the FS, what would be the cost to gain those features b

What features does ZFS have that ext4 doesnt? Its a simple question, but you had to act like an ass. Good job.

Jeez, where to start? They're night and day. EXT4 has more in common with FAT32 or UFS than it does ZFS.

It's got a handful of core features, all of which are significant on their own:

* copy-on-write, so you know your data gets committed* integral RAID-like functionality, integrated with the filesystem. This reduces overhead and eliminates the need for archaic RAID controllers (almost) entirely (complete with their shitty firmware and quirks, etc.) - just the controller, please.* Due to the above two, elimin

Half of which's results will be one discussion forum or another where people who are not smug asses thoughtfully took a moment to answer a person's question.

You had time to post this self-important drivel, surely you have time to answer the question as well -- but you elected for the drivel. And you think that somehow says something about the people asking the question rather than about you?

Thanks for replying like a jerk, that really helps us all out. Nobody is going to simply transition to a new way of doing things just because it's new, they need to know what they'll get from the new way that makes the transition worthwhile.

Forcing any file that requires more than a single block to use a tree of block pointers probably doesn't help. The dnode only has one block pointer and the block pointer can only point to a single block (no extents). On the plus side, the block size can vary between 512 bytes and 64 KiB per object, so slack space is kept low. If more than a single block is necessary it creates a tree of block pointers. Each block pointer is 128 bytes in size, so the tree can get deep fairly quick.

Three copies of almost all file system structures (such as inodes, but called dnodes in ZFS) by default can't help (which are compressed of course).

don't forget the intent log -being able to recover from failed power issues is great, but unless you use a separate flash zil device, it ain't quick ('course, that assumes they are using sync'd writes).

Um... WTF? Compression is a performance *improvement* and a massive one, at that. The trivial cost in CPU time is offset by the massive reduction in IO time, which is more expensive by far. This has been true since 2000 or even earlier. Modern multi-core CPUs just take the CPU penalty from negligible to nonexistent. Unless your CPU cores are all running at 100%, and possibly even if they are, compression will improve performance.

Note that this is true on a wide variety of filesystems; it's nothing special to these particular ones. Hell, NTFS has had built-in compression for a decade or more. You can improve performance on a Windows system by right-clicking the C: drive and selecting Properties -> Compress this drive. You can do it from the command line using

compact.exe/C/S:C:\/A

This will compress all files in or under the root of the C drive, including hidden or system files (requires admin, of course) and marks all the directories so that any files written to them will also get compressed.

Snapshots.And I don't just mean any snapshots.Done right, like in ZFS, they are fast.Faster than BSD's UFS snapshots, faster than using LVM's fs-agnostic snapshots. For people who need them, they're great.

BREAKING NEWS! Journaling filesystems with write caching, including the ever-popular NTFS, are vulnerable to data loss in sudden power failures! Total noobs were left with no idea how to go about fixing the problem.

"If only there were some way to run a check on the file system and perform automatic repairs! OH GOD WHAT DO I DO!?!?!" one commented.

Nigel Tufnel: My RAID array are all RAID-11. Look, right across the rack, RAID-11, RAID-11, RAID-11and...
Marty DiBergi: Oh, I see. And most arrays go up to RAID-10?
Nigel Tufnel: Exactly.
Marty DiBergi: Does that mean it's faster? Is it any faster?
Nigel Tufnel: Well, it's one faster, isn't it? It's not RAID-10. You see, most blokes, you know, will be serving files at RAID-10. You're on RAID-10 here, all the way up, all the way up, all the way up, you're on RAID-10 on your database b

Indeed. The main reason to use ZFS over the other ones, even in cases where the features are the same is that ZFS is more widely available. Admittedly, it's far from universal, but right now it's officially supported in more than one OS. I'm not aware of a filesystem that provides similar functionality to ZFS which is more widely available.

And it's hardly fair to compare a filesystem that's being run in such a convoluted way to one that's able to be much more tightly integrated, especially considering th

The main reason to use ZFS over the other ones, even in cases where the features are the same is that ZFS is more widely available. Admittedly, it's far from universal, but right now it's officially supported in more than one OS. I'm not aware of a filesystem that provides similar functionality to ZFS which is more widely available.

Actually, I've run into this problem, not with ZFS (haven't used it), but with other filesystems, on Linux only. It seems not all filesystems are truly endian-aware, so moving a USB disk created on a big-endian system and moving it to a little endian system results in a non-working filesystem. Had to actually go and use that system to mount the disk.

Somewhat annoying if you want to pull a RAID array our of a Linux-running big-endian system in the hopes that you can recover the data... only to find out it was using XFS or other non-endian-friendly FS and basically not be able to get at the data...

The SPL packages provide the Solaris Porting Layer modules for emulating some Solaris primitives in the Linux kernel, as such, this ZFS implementation is not ported to purely take advantage of the Linux kernel design.

ZFS is, until BtrFS hits truly enterprise stable, the only FS for large disks, in my opinion. I currently run ZFS on about 10 TB. I never worry about a corrupt file system, never have to fsck it. And snapshots are cheap and fast. I shapshot the entire 10 TB array in about 30 minutes (about 2000 file systems). Then I back up from the snapshot. In other areas of the disk I do hourly snapshotting. Indeed snapshots are the kill feature for me for ZFS. LVM has snapshots, true, but they are not quick or convenient compared to ZFS. In LVM I can only snapshot to unused space in the volume set. With ZFS you can snapshot as long as you have free space. The integration of volume management and the file system may break a lot of people's ideas of clear separation between layers, but from the admin's point of view it is really nice.

We'll ditch ZFS and Solaris once BtrFS is ready. BtrFS is close, though; should work well for things like home servers, so try it out if you have a large MythTV system.

ZFS is both a filesystem and volume manager. I can't see how anyone would actually prefer the LVM management style to the All-in-One of ZFS, but whatever cocks their pistol.

Also it's absolutely shocking that phoronix would have benchmark which resulted in a Linux component clearly out preforming a roughly equivalent component from another OS. That's not their MO or anything. I'm sure they took great pains to ensure equality as they always do.

Wrong answer. XFS is extremely prone to data corruption if the system goes down uncleanly for any reason. We may strive for nine nines, but stuff still happens. A power failure on a large XFS volume is almost guaranteed to lead to truncated files and general lost data. Not so on ZFS.

30 minutes? That's insane. An LVM2 snapshot would take seconds. I fail to see how that's not quick, and how "lvcreate -s" is less convenient.

Glad to know LVM is faster though. However, as I stated before it's not conveni

Wrong answer. XFS is extremely prone to data corruption if the system goes down uncleanly for any reason. We may strive for nine nines, but stuff still happens. A power failure on a large XFS volume is almost guaranteed to lead to truncated files and general lost data. Not so on ZFS.

On ZFS, if the system goes down uncleanly you should avoid data corruption so long as every part of the chain from ZFS to your hard drive's platters behaves as ZFS expects and writes data in the order it wants. If it doesn't, you can easily end up with filesystem corruption that can't be repaired without dumping the entire contents of the ZFS pool to external storage, erasing it, and recreating the filesystem from scratch. If you're even more unlucky, the corruption will tickle one of the bugs in ZFS and ev

Wrong answer. XFS is extremely prone to data corruption if the system goes down uncleanly for any reason. We may strive for nine nines, but stuff still happens.

What? That's true of any filesystem, and especially ZFS as practical experience shows. The only way to reliably keep any filesystem going is to keep it on a UPS and talking about 'nine nines' in that context is just laughable.

I keep hearing this shit over and over, mostly on idiot infested Linux distribution and Solaris fanboy forums, and it's ju

In an enterprise you're typically dealing with SAN. Just simply "adding physical volumes" isn't quite so simple. What if your disk array is full? Just tack a USB disk on the server? For us, all our SANs are hardware RAID (we don't use RAID-Z), so adding new volumes, as you suggest, involves buying at least 4 disk (RAID-6), sticking them in the chassis and creating a hardware volume set. It's quite an undertaking to expand storage. LVM can certainly accommodate our hardware, but would certainly not be

Well they tested on a single SSD.I have not used ZFS or Btrfs but I have read a lot about ZFS.This is not really the use case for ZFS. ZFS has many features for things like using an SSD to cache for the HDDs , RAID like functions, data compression and so on.The idea that a simpler less full featured file system is faster is no big shock.I would like to see tests with maybe two wan servers each with say 12 HHDs and an SSD for cacheing. That is more the use case for ZFS than a workstation with a single SSD.

People like my cousin who run a data center with 10,000+ hard drives and by requirement must have a File System that has been considered stable for at least 5 years. Any data loss is unacceptable. Unless God targets you with His wrath, you have no excuse for any data loss or corruption.

Who would have thought that a first-release Beta kernel module would not run as fast or be as reliable as the stable implementation for other operating systems, or the stables on Linux?

The full release is supposed to be coming out in the first week of January. Given the short time frame, it would seem like this is probably closer to the final release than the words " first beta" imply.

Surprises:

Native ZFS beat XFS on several of the benchmarks - XFS is usually a good performer in these kind of tests

Native ZFS does very well on the Threaded IO Test, where it ties for first place.

Btrfs is really bad on the SQLite test, taking 5 times longer than XFS on both 2.6.32 and 2.6.37 (bug?)

XFS recently implemented a new journaling subsystem that should speed up metadata-intensive operations. Once they turn it on, it will gain even more performance (and Ext4 is also getting many scalability improvements)

But that's not particularly helpful. I don't believe that Btrfs is supported beyond Linux at the moment and neither FreeBSD nor Open Solaris support both. Meaning that you're comparing a filesystem that's been grafted onto Linux via fuse with one that can ultimately be integrated into the Linux kernel.

OpenAFS, which still today provides features unavailable in any other production-ready network filesystem, is a nightmare to use in the real world because of its lack of integration with the mainline kernel. It's licensed under the "IPL", which like the CDDL is free-software/open source but not GPL compatible.

ZFS is very cool, but this approach is doomed to fail. It's much better to devote resources to getting our native filesystems up to speed -- or, ha, into convincing Oracle to relicense.

Personally, I was pretty sure Sun was going to go with relicensing under the GPLv3, which gives strong patent protection and would have put them in the hilarious position of being more-FSF free software than Linux. But with Oracle trying to squeeze the monetary blood from every last shred of good that came from Sun, who knows what's gonna happen.

You mean like how the Nvidia GPU driver has failed because of licensing conflict? I see no reason why the ZFS module can't be distributed in a similar manner to the nvidia driver. I'm sure that rpmfusion could host binary RPMs without problem. They wouldn't be violating the GPL because it would be you the user who taints the kernel.

Of course ZFS on Linux probably isn't aimed at normal users anyway. It's far more likely to be used by people with existing ZFS infrastructure (large fiber-channel arrays, et

ZFS is very cool, but this approach is doomed to fail. It's much better to devote resources to getting our native filesystems up to speed -- or, ha, into convincing Oracle to relicense.

Personally, I was pretty sure Sun was going to go with relicensing under the GPLv3, which gives strong patent protection and would have put them in the hilarious position of being more-FSF free software than Linux. But with Oracle trying to squeeze the monetary blood from every last shred of good that came from Sun, who knows what's gonna happen.

Um, just who do you think is writing BTRFS? http://en.wikipedia.org/wiki/Btrfs [wikipedia.org] I know its fashionable to knock Oracle every chance you get... but Look at the line:

Btrfs, when complete, is expected to offer a feature set comparable to ZFS.[16] btrfs was considered to be a competitor to ZFS. However, Oracle acquired ZFS as part of the Sun Microsystem's merger and this did not change their plans for developing btrfs.[17]

Um, just who do you think is writing BTRFS? http://en.wikipedia.org/wiki/Btrfs [wikipedia.org] I know its fashionable to knock Oracle every chance you get... but Look at the line:

As I understand it, Chris Mason brought his btrfs work with him when he started at Oracle, or at least the ideas for it. A kernel hacker of his caliber probably started the job with an agreement of exactly how that was going to go.

Oracle is a big organization; it's not surprising they act in apparently contradictory ways. They've done a reasonable amount of good open source work and made community contributions. But I stand by the statement that it's impossible to make a good prediction as to what Oracle is

I've been through a few filesystem war^Wdramas and stuck with ext?fs the whole time. I liked the addition of journaling but I'm not sure that I've noticed any of the other "backstage" improvements in day to day functioning.Is there really a reason to jump ship as a single-workstation user?

Snapshotting is probably the most compelling feature of either FS for workstation use. Both BTRFS and ZFS are copy-on-write, and they both feature very low overhead, very straightforward snapshotting. That's a feature that almost anybody can utilize.

Aside from that, ZFS features a lot of datacenter-centric goodies that might have some utility on a workstation as well. Realtime (low overhead) compression, realtime (high overhead) deduplication, realtime encryption, easy and fast creation/destruction of files

The ext?fs work well unless they don't. In my, admittedly limited experience, I've lost more files on ext2fs than on all other filesystems I've dabbled in combined. Admittedly, I had backups, but any fs that depends upon you having backups to that extent should not be trusted. And while I'm sure the newer ones are better, I'm not sure that I personally trust them as ext2fs shouldn't have been that easy to corrupt. IIRC that was only a couple years ago, and it should've been both robust and well undestood by

For me, journaling was the reason to move from ext2 to ext3. However, for an end user, ZFS has a few cool features that are significant:

1: Deduplication by blocks. For end users, it should save some disk space, not sure how much.2: File CRCs. This means file corruption is at least detected.3: RAID-Z. 'Nuff said. No worry about the LVM layer.4: Filesystem encryption.

It's OK, runs fairly stable, but it also locks up once in a while and does some aggressive disk I/O. No idea what exactly, probably housekeeping, but it's somewhat irksome, could use some more fine tuning.

The main problem with btrfs right now is that it lacks fsck tools, so in case of havoc there is little chance to recuperate, which is not good for server like systems.

As for ZFS, it's not the tech that's keeping it from Linux but the restrictive licensing. Unless that gets fixed (probably won't happen), it

"As for ZFS, it's not the tech that's keeping it from Linux but the restrictive licensing."

Just to be clear: between CDDL (ZFS) and GPL (BTRFS), GPL is clearly the more restrictive license. BTRFS can probably never be shipped with any other major OS other than linux (at least not in kernel mode), while ZFS has already shipped with a few.

The license restriction is one of linuxes making, not ZFS's. There are arguments for that restricion, but calling the problem one of CDDL being restrictive is a completly di

It's still under development. But it's already pretty competitive, doing reasonably well in many tests.

And then there's this (on the last page) "Ending out our tests we had the PostMark test where the performance of the ZFS Linux kernel module done by KQ Infotech and the Lawrence Livermore National Laboratories was slaughtered. The disk transaction performance for ZFS on this native Linux kernel module was even worse than using ZFS-FUSE and was almost at half the speed of this test when run under the OpenSolaris-based OpenIndiana distribution."

Ok, maybe someone can disabuse me of a misconception that I have, but: There's no reason that ZFS in the kernel should be slower than a FUSE version. That means there's something wrong. If they figure out what's wrong and fix it, that could very likely affect the results in some or all of the other tests.

ZFS isn't done yet, and it already looks like it might be worth the trade-off for the features ZFS provides. And performance might get somewhat better. This article is good news (though that final benchmark is distressing, especially when you look at the ZFS running on OpenSolaris).

It says: "When KQ Infotech releases these ZFS packages to the public in January and rebases them against a later version of ZFS/Zpool, we will publish more benchmarks."

A lot of it has to do with the reading of compressed data, and the huge ram-buffer that ZFS uses on the OS, optional commit on writes, block sizes that match the database pages.

The system scans 3 megs of index data, what it's actually reading to get that off is say 1 meg, which it decompresses on the fly on one of the many cores the database server has. In the end throughput destroys what i get running non-compressed volumes on EXT4 or XFS on Linu

Has anyone here had experience tuning Postgres on Linux versus Solaris/ZFS ? We're not a huge shop, 8 people running large data-warehouse type applications. We run on a shoestring and don't have a bunch of money to throw at the problem and would be very grateful for any ideas on how to make our database run with comparable performance on Linux. I'm hoping that I'm missing something obvious.

What have you done so far and how are you using Postgres? Mostly reads, mostly writes or some combination of the two? Postgres as it ships is notorious for slow configuration, and many Linux distributions are consistently one major version behind the curve (which is a little annoying as much of the focus of the Postgres people for some time has been improving performance).

We have tables that have as many as 100m records, where Solaris/ZFS seemed to help massively was the big reads for reporting. We have indexed it pretty aggressively, even going so far as to index statements and managed to pull amazing performance, considering the concurrency we see from a free database. (Which for the record, has never given us any problems... postgres has been rock-solid)

for the most part it was running "ok" on linux, but the bump we got from th

Picking on ZFS for being slow when ported to a different OS and running on atypical hardware is like criticizing Stephen Hawking for being a poor juggler. It's focussing on the wrong thing.
The goals of ZFS are, in no particular order:
- Scalability to enormous numbers of devices
- Highly assured data integrity via checksumming
- Fault tolerance via redundancy
- Manageability/usability features (i.e., snapshots) that conventional file systems simply don't have
Oh, and if it's fast, well, that's gravy.

The fast can be achieved by more/better hardware. A filesystem shouldn't have 'fast' or 'faster than ye' as it's primary focus anyway. If it's very fast but not 100% trustworthy it's not a good file system (eg. ReiserFS).

Some features that make ZFS a bit slower are thought up by people that have years of experience in large SAN and other storage solutions. Writing metadata multiple times over different spindles might seem overkill for most but that is until you lose a N+1 spindles (or just get r/w errors on

Picking on ZFS for being slow when ported to a different OS and running on atypical hardware

How is he picking? He's just measuring the file system performance compared to others on a specific OS.

It's focussing on the wrong thing.

I don't think it is, this person wanted to measure performance on Linux, not compare features and he got what he was testing. I would imagine there are plenty of people who want to know how well it performs - regardless of features - in comparison to other filesystems.

Since ZFS is doing metadata replication, running the tests on a single disk is going to punish ZFS performance much more than other filesystems. It would be much more interesting to run a benchmark with an array of 6 or 8 disks with RAID-Z2, with ZFS managing the disks directly, and XFS/btrfs/ext4 running on MD RAID-6 + LVM. Next, run a test that creates a snapshot in the middle of running some long benchmark and see what the performance difference is before/after.

The consistency guarantees provided by the tested filesystems differ significantly. Most (all?) aside from ZFS only journal metadata by default. All data and metadata written to ZFS is always consistent on disk. You won't notice the difference until you crash, and even then you still might not, but it will certainly show up in the benchmarks.

ZFS is not a lightweight filesystem, that is a fact. The 128-bit addresses, 256-bit checksums, compression, and two or three way replicated metadata don't come for