ZFS auto-snapshots: a new home for 0.12

Quick post here, to mention that if you still use the old (non Python-based) zfs-auto-snapshot SMF service, since mediacast.sun.com went away, and hg.opensolaris.org is no more, there’s not really anywhere for this service to live.

While this code was never intended to be any sort of enterprise-level backup facility, I still use this on my own systems at home, and it continues to work away happily.

Like this:

Related

Post navigation

17 thoughts on “ZFS auto-snapshots: a new home for 0.12”

Great! I will try it later. At the moment I use a rsync for my backups. With zfs send/recv it will hopefully be much faster. This is what you use inside, yes? I’m also interested in your old non-python-based service. Could you make it available, too? Just for study purposes :-)

This is the non-Python based version – it’s just a simple ksh93 script that runs via a cron job. Depending on what sort of data you’re sending/recving, this can be faster than rsync, but it does depend on the data.

As for my own personal backups, I tend not to use this service for off-site backups themselves (which I perform much less frequently, and which are slightly complicated by my use of encrypted ZFS datasets) but I do find it invaluable for taking care of automatic snapshots, which are strictly speaking not backups because they reside on the same physical box. Ultimately, for me automated snapshots perform a similar role when I’ve deleted something I didn’t mean to delete and need to get it back.

Sorry for the mess if this fix brought some grief to some installations… in our case (support of older boxes with Solaris 10u8, dtksh, as well as some older SXCE versions with dtksh) this fix seemed like what was needed to achieve proper trimming of excluded child datasets.
I wonder if there is some misinterpretation between different shells, and if it is possible to make a fix that caters for both variants? Perhaps, match the leading double-slash optionally (with a question mark as it would be in a regex)?..
//Jim

Tim,
Thank you so much for making this available – such a shame that so much good software and invaluable information was lost with the domain shutdowns.

I’m currently trying to get this working on Solaris 11.1 SRU 4.6 and the services keep failing with “snapshot already exists no snapshots were created cannot create snapshot” being logged, then the services end up in maintenance.
I also see “Unable to take recursive snapshots” errors and saw your postings from 2009/06/24, but the bug references are all on the lost domains, so I have no idea what they were or how I might be able to work around them. :^(

I’ve verified that fs-name is “//” (stock/default).
I noticed that “com.sun:auto-snapshot” is set to “true” by default, but none of the instance-specific entries were there – so I added them (“com.sun:auto-snapshot:frequent” is the main one I’m testing with, but I added them all).

Any ideas?

P.S. The reason I’m using this instead of “time-slider” is I don’t want a full desktop installation on my servers.

I just wanted to say that the property “com.sun:auto-snapshot” was the cause of the problems I was experiencing. Once I set it to false for all filesystems, the failures stopped and the snapshots are properly taken for (only) all filesystems with the service-specific properties assigned.

Note that before I changed the ambiguous property to false, it was taking snapshots of all filesystems, including alternate boot environments, etc.

So, step 0 (on Solaris 11.1) should be to set com.sun:auto-snapshot=false, before enabling the services.

Also, in case anyone is curious, this works perfectly fine in a Solaris 11.1 zone.

That’s interesting that there seems to be a negative interaction between com.sun:auto-snapshot and com.sun:auto-snapshot:label property – they’re supposed to work in conjunction with each other, including and excluding child datasets as appropriate.

The code to figure that out, and minimize the number of ‘zfs snapshot’ processes that get invoked is a bit hairy though, and it’s always been something I’ve wanted to have another crack at. Some year, I hope to rewrite this stuff again, only in Python, but otherwise keeping exactly the same functionality…

My apologies, but I misinterpreted the previously described bug and fix.

The fix was coincidental, as the bug is actually hierarchically nested zfs filesystems with the same snapshot schedule enabled. It can also be triggered with the instance-specific properties.

For example: if I have rpool/export/home/myhome enabled for the frequent schedule and I “zfs create rpool/export/home/myhome/mysubdir”, the next run of the “frequent” instance will throw it into maintenance mode, unless I manually disable the inherited property enabling the snapshots. Unfortunately, this means the children cannot be snapshotted (zfs/fs-name=’//’, so zfs/snapshot-children is ignored).

Also, whether or not snapshot-children is enabled, “Taking recursive snapshots of” is reported to dmesg output (I have verbose set to true), as is “Error: Unable to take recursive snapshots of ” (referencing the snapshot enabled child filesystem).

So zfs/snapshot-children is always ignored if ‘//’ is the value of zfs/fs-name.

The intent is that when those service properties are set, the com.sun:auto-snapshot ZFS user-properties on the datasets are the only things that determine which datasets are included in the snapshot schedule. If child datasets are inheriting that ZFS user-property, we try to optimise the taking of snapshots by snapshotting only the parent dataset, but using the ‘-r’ flag to zfs to cause it to create snapshots on all children as well.

That said, I haven’t had time to revisit this code in a long time. One of these days, it’ll annoy me enough that I’ll just start on the rewrite!

If I create rpool/export/home/kevin and have it enabled for snapshotting, it works great.
If I create (under that) rpool/export/home/kevin/tmp (without disabling it), it breaks.
If I clear rpool/export/home/kevin/tmp from snapshotting (so only the parent is snapshotted), it works.
If I create rpool/export/home/kevin/tmp/test, disable it, and enable it’s parent (so kevin and kevin/tmp are being snapshotted, but kevin/temp/test is not), it works.
If all 3 levels are re-enabled, it breaks again.

The problem appears to be that it goes through the list of filesystems (kevin, kevin/tmp, etc.) and performs a recursive snapshot at each level. Once it gets to the second level, the snapshot with that name already exists, and it bombs out.
If the bottom level is disabled from snapshotting, this blocks the recursion code from being run and it works fine.