I am considering a ZFS/iSCSI based architecture for a HA/scale-out/shared-nothing database platform running on wimpy nodes of plain PC hardware and running FreeBSD 9.

Will it work? What are possible drawbacks?

Architecture

Storage nodes have direct attached cheap SATA/SAS drives. Each disk is exported as a separate iSCSI LUN. Note that no RAID (neither HW nor SW), partitioning, volume management or anything like that is involved at this layer. Just 1 LUN per physical disk.

Database nodes run ZFS. A ZFS mirrored vdev is created from iSCSI LUNs from 3 different storage nodes. A ZFS pool is created on top of the vdev, and within that a filesystem which in turn backs a database.

When a disk or a storage node fails, the respective ZFS vdev will continue to operate in degraded mode (but still have 2 mirrored disks). A different (new) disk is assigned to the vdev to replace the failed disk or storage node. ZFS resilvering takes place. A failed storage node or disk is always completely recycled should it become available again.

When a database node fails, the LUNs previsouly used by that node are free. A new database node is booted, which recreates the ZFS vdev/pool from the LUNs the failed database node left over. There is no need for database level replication for high-availability reasons.

Possible Issues

How to detect the degradion of the vdev? Check every 5s? Any notification mechnism available with ZFS?

Is it even possible to recreate a new pool from existing LUNs making up a vdev? Any traps?

Why are you considering such a beast? The best description I can think of for what you're describing is that it's "Technically possible, but unwise."
–
voretaq7♦Aug 21 '12 at 15:19

The main reason: with a "standard" architecture where ZFS RAID would be used over local physical disks and only that exported as iSCSI LUNs, there needs to be additional redundancy to protect against failures of complete storage nodes. In other words, in our case, database replication. And that is complex also. Why is it unwise?
–
oberstetAug 21 '12 at 15:30

1

Aside from being a pretty non-standard way to set things up (I've literally never seen a configuration like this, even in labs -- if you do it document and test the hell out of it!) you're introducing a bunch of network links into the equation, with associated latency and failure chances. Since you already have N storage nodes with local disk why not consider something like HAST & traditional failover tools for the node-failure protection?
–
voretaq7♦Aug 21 '12 at 15:51

Thanks for pointing to HAST! Using HAST would be quite similar: redundancy above storage nodes holding physical disks and below the database replication level. But HAST only supports 2 nodes, only 1 can be active. ZFS mirror supports also 3 storage nodes and can use all for read ops increasing throughput. HAST uses CARP .. additional complexity. As to the architecture being unusal: you are probably right, I am wondering why? Creating ZFS vdevs over SAN LUNs seem to be even recommended: hub.opensolaris.org/bin/view/Community+Group+zfs/…
–
oberstetAug 21 '12 at 16:09

CARP really isn't all that complex (see my answer & the linked docs) - there really isn't anything fundamentally wrong with what you're proposing, but I don't think it's been done (or at least not documented), so the edge conditions are all unknowns and would need testing. If you do implement this a blog post or an answer here detailing your setup & any oddities you encounter in testing would be much appreciated - I can see some benefits to the design you're proposing. You can also hop on to SF Chat to talk through the design further.
–
voretaq7♦Aug 21 '12 at 16:22

Both Machines

The big caveat here is that HAST only works on a Master/Slave level, so you need pairs of machines for each LUN/set of LUNs you want to export.

Another thing to be aware of is that your storage architecture won't be as flexible as it would be with the design you proposed:
With HAST you're limited to the number of disks you can put in a pair of machines.
With the ISCSI mesh-like structure you proposed you can theoretically add more machines exporting more LUNs and grow as much as you'd like (to the limit of your network).

That tradeoff in flexibility buys you a tested, proven, documented solution that any FreeBSD admin will understand out of the box (or be able to read the handbook and figure out) -- to me it's a worthwhile trade-off :-)

"zpool status -x" will output whether all pools are healthy or output the status of ones that are not. If a iSCSI LUN vdev goes offline a cron job running a script based around that command should give you a way to have cron alerts on a regular basis.

"zpool import" should be able to import the existing zpool from the iSCSI LUNs vdevs. You may have to force the import if the pool was not exported cleanly but internal metadata should keep the data in a consistent state even if writes were interrupted by the database node failing.

Periodically running zpool status, ok. But there is nothing "event driven"? Is there a ZFS Dtrace provider that could emit events like a failed disk? Regarding zpool import: so all the info needed to recreate a pool is stored within the vdevs (and with vdev mirror, hence redundently)? No info from the host (which may have died completely) is needed?
–
oberstetAug 21 '12 at 16:22

I am not aware of any event driven way of noticing a disk going offline, maybe in your design maybe the iSCSI initiator could have proactive monitoring when it is unable to communicate with a target, or some other way to monitor the LUN availability directly.
–
Thomas GAug 21 '12 at 18:05

Its even more tricky. I am using a test setup of 2 storage VMs and 2 database VMs all FreeBSD9. Created a 2-way mirrored pool over LUNs from the 2 storage nodes. I could verify resilvering after a storage node was gone and reappeared. However: I needed to kill iscontrol on the database node for the left storage node. Without doing that iscontrol tries to silently reconnect forever. It only does start doing so after access of the ZFS pool. But as long as it retries any ZFS pool command just hangs. Need to do more experiments. Can I run iscontrol without it daemonizing itself?
–
oberstetAug 21 '12 at 19:40