ZFSonLinux on Ubuntu 13.04 build log

Just wanted to post this here in case anyone is interested. It was straightforward.

We got a NetApp E5400 with 60x3TB SATA. We got a new Intel 1U whitebox. It has two ~120GB SSD. /dev/sda is regular Ubuntu 13.04 install. /dev/sdb I formatted into a 4GB partition and the rest ~105GB.

Install the ZFSonLinux PPA and "aptitude install ubuntu-zfs".

Install multipath-tools, we have 2 x FC HBAs, so we have redundant HBAs for the reduntant FC NetApp controllers. Default 'multipath -ll' is doing the right thing with the NetApp rdac. Formatted the NetApp disks into 10-disk RAID-6 (8+2).

Doing some 'dd' runs on the multipath devices shows the right thing, two active paths, two backup paths, depending on which controller is the 'home' controller for the LUN.

One word of warning: ZFS cannot guarantee data integrity when it's running on top of a RAID array.

This is not to say that ZFS is less safe than anything else on top of a bunch of RAID6s - just that it isn't necessarily as safe as it COULD be, either. If it can't guarantee when data is or is not written to the disk, it can't guarantee filesystem integrity. (Neither can any other filesystem.) And since ZFS can't see the individual disks in your SAN, it can't checksum the blocks written to individual disks - it can only checksum blocks written to the RAID6 array. Further, since you don't have any parity at the ZFS level (ie your pool was built without RAIDZ), if ZFS does detect corruption, it has no way of repairing it - all it can say is "sorry boss, you got bad data".

Can the NetApp even export the disks as raw JBOD? I'm talking down to the point were you can run smartctl on each disk and pull the actual drive specs? Assuming the NetApp can export each drive as raid0 with its own proprietary block exporting scheme, and you created a ZFS raidz2 in the same way you did with the NetApp, would that allow zfs to fix corruption like Jim mentioned above? It sounds like when using ZFS it would be better to just get vanilla SAS HBA's and string a bunch of generic supermicro SAS JBOD's together. You could probably get triple the amount of storage for the price of the NetApp too. Linux using smartd along with ZFS could probably do everything the NetApp offers.

I don't understand why you're doing this when NetApp offers many of the features of ZFS. Your NetApp box is also a heck of a lot more redundant than the white box you just put in front of it. What is the goal of the project?

This NetApp is not "regular NetApp", but their E5400 "HPC" series, so it's just the LSI "dumb" block storage. NetApp bought LSI recently. Same as the IBM DS5000 series or Dell MD3000 or whatever other LSI rebrands are.

So the controllers are relatively dumb when compared to OnTap, but there's also no provision for just acting as JBODs. So we use what we have. It is true that with this tank setup, if anything goes wrong with any of the block devices, the whole ZFS is toast.

I'm not suggesting you build a system just like this, our hardware is the result of lots of University bureaucracy and budget nightmares, but I think our group ended up only having to purchase the one whitebox.

Since we're just looking to make one big filer, there's no good way to make an HA system without using a clustered filesystem (and we don't want to do active/passive, or use a clustered filesystem). So we're basically replicating the Dell NSS, but with hardware RAID and ZFS. And for testing purposes, though I'm sure it'll end up as "production" for some folks.

Since it's got two SSDs in it, you might want to seriously consider slicing the second one up as L2ARC and/or ZIL, depending on whether you're thinking of doing db stuff on your frankensan. The L2ARC on SSD is going to be a good idea regardless, with that much total storage, and the ZIL will make a big difference if you have a ton of synchronous writes (usually read: database applications).

It's my understanding that the ZIL contains the only copy of unwritten data that's being cached by it. As such if you lose the devices backing the ZIL you lose any pending writes that haven't been committed to the pool, and file get corrupted and so on and so forth. Everything I've ever read says back the ZIL with a RAID set with redundancy of some sort (usually just a RAID1 mirror of SSDs).

Has that changed?

Because if not, I'd rather not have a ZIL than have the time bomb of a ZIL that could suddenly up and disappear and take anything it was caching along with it. Losing the L2ARC isn't nearly as big of a deal since it's the read cache and therefore anything it stores is already committed in the pool.

if you lose the devices backing the ZIL you lose any pending writes that haven't been committed to the pool ... Everything I've ever read says back the ZIL with a RAID set with redundancy of some sort (usually just a RAID1 mirror of SSDs).

Yes, good catch - and with 130TB or so of storage, that might be a BIG deal. Only one correction: usually just a RAIDZ mirror of SSDs.

Good point, I can rebuild with a ZIL mirror. What happens if the L2ARC disappears? I assume it'll still cause an outage, just not loss of data. Or can ZFS simply fail the read cache device and keep working? I guess I can try that, just yank out that disk.

Not sure if losing L2ARC would cause a service interruption or not... Find out for us? :-)

The "surprise removal" of L2ARC is harmless. You will just see a sudden decrease in performance, at least under FreeBSD 9.1. If you have hardware problems, though, and the underlying device driver crashes, you might end with a hanging system. In that case redundancy won't help, though.Edit: I actually tested L2ARC removal on my ZFS box, but not ZIL removal.

The ZIL is actually less critical than commonly thought. The problems of not being being able to import a pool with a failed log device should be a thing of the past now. And afaik ZFS keeps the data it writes out to the ZIL in RAM until it can flush them to disk. So the ZIL is only a RAM backup that gets only read in the event of a power failure with unflushed writes.If the ZIL SSD dies the usual failure mode is a big performance drop and nothing else. Only if it comes up dead after a power failure you're in real trouble.If there's two SSDs in the box, though, I would definitely create a mirrored ZIL.

The ZIL is only used for synchronous writes and in normal operation it is never read from. You can use single disks or mirrors, but not raidz vdevs. Older versions of zfs were unable to tolerate the loss of their ZIL device, but newer versions are capable of surviving the loss of the ZIL (and a unplanned shutdown) with only data loss (rather than pool corruption). You can also cleanly detach the ZIL from the pool while it is running if you need to with no data loss.

The L2ARC is used mainly for random reads, streaming reads generally skip it and are read directly from the underlying storage. It is made up of single disks, mirroring and raidz are not allowed. You can lose these at any time without any issues.

Asynchronous writes to ZFS are cached in memory and flushed on a timer. The application is told that the data is "written" as soon as it is in the memory queue.

Synchronous writes to ZFS are cached in memory and written to the ZIL. The application is told that the data is written as soon as it is safely stored on the ZIL. The writes are then read from memory and flushed to the actual pool on a timer. The ZIL is only read from if they system crashed in between informing the application that the data is on disk, and when it is actually flushed to the proper location in the pool.

What's maybe noteworthy to make clear* is that the ZIL's hit by streaming writes, whereas the L2ARC is bypassed for streaming reads. The ZIL's bypassed for asynchronous writes, though.

There also is a ZFS tuneable to make the L2ARC service streaming reads. I enabled that on my NAS box as my Windows iSCSI disk's access pattern is mainly streaming reads from ZFS' perspective. (Less important, there's also a tuneable for the population speed of the L2ARC that's rather low for home and small scale setups. That's mainly important for the time when the cache's not hot yet, which should be extremely rare for a ZFS box, though.)

There is, indirectly, also a tuneable to force everything through the ZIL if present. If you set the sync property of a FS to always, all data's going to be piped through the ZIL. That rarely makes sense to use, though.

*As in emphasizing and comparing directly, you pretty much said all that already.

So I started up some NFS client I/O and some local I/O, and then went and physically yanked out /dev/sdb. I/O paused for about 30s, then resumed.

Code:

root@server:~# zpool status pool: tank state: DEGRADEDstatus: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state.action: Replace the device using 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-4J scan: none requestedconfig:

Had to copy my RaidZ mirror over to another disk because of the ZFS version differences and create a new pool to copy it all back.

This SHOULDN'T be a true statement AFAIK. zpool upgrade generally cures all ills in-place. That said, I've only migrated from FreeBSD->Ubuntu and from older versions of ZFS on Ubuntu to newer versions - if there's a feature in your version of Solaris that has no equivalent in ZoL, you could have issues. Kinda doubting that right now though.

Solaris 11 is post-closed sourcing of the Solaris code and it runs a ZFS version that is newer (and unsupported) by any of the open source projects that use ZFS. The open source ones are also using a (different) newer version of zfs and you can't move from them to Solaris 11 either. ZFS 28 is the newest fully interoperable version.

Ah, thank you. I don't so much keep up with the Solaris end of things, and especially not after the Evil Empire our good friends at Oracle bought it out and did their... inevitable... thing with it.

What really makes things more confusing for me is that there ARE no more version numbers anymore - just "features" which are or are not enabled, and do or do not make underlying changes to the metadata (if they do, then the migration target has to support that feature; if they don't, then the target need not support it). Confusing.

@Bastard - thinking about doing the same (except with Debian); I have RaidZ mirror too - was the copy absolutely necessary?? Noticed any difference in pool performance?

Yes, the Solaris 11 version is not compatible with Linux. Another reason was that I went from a mirror to a RaidZ.

I only had three disks and they all needed to be in the RaidZ but I didn't have any spare disk to copy my data to.

I found out you can use a file as a ZFS device and created a RaidZ of 2 disks and 1 file. Copied the data over and after that replaced the file in the RaidZ with the physical disk. Couple of hours of resilvering later everything was running smoothly.

No performance problems but my overall impression is that Solaris is faster.

Probably what I will do is first make backups (already have off-line backup). And then install Debian Linux / to ext4 with an ext3 /boot using mdraid raid1 on a separate pair of drives. Once Debian is up and running, then install ZFSonLinux and try to import the FreeBSD pool.