We do the same thing as you: RAID 5 with Shadow Copy; however we also have two off-site USB hard drives which backup using Robocopy every night (rotate drives twice a week so one is always off-site). This provides us with backups for disaster recovery as well, but not long term archives, which our small organization doesn't really need. You should upgrade to at least have an off-site copy of the data on your server as if your RAID array dies, you'll lose your snapshots too.
–
Austin ''Danger'' PowersMar 9 '14 at 17:12

If you want to find out whether it's possible for a RAID array to fail as a whole, hit one with a sledgehammer and try recovering your data. There's a whole class of bad stuff that can take out an entire box without taking out the entire site. That said, if your on-site backups are just a convenience that might save you recovering more slowly from off-site backups, then in principle they can be as bad as you like.
–
Steve JessopMar 10 '14 at 0:08

Yes, we already have off-site backups and a more "traditional" on-site solution. The reason I asked this question because I read around about the features of btrfs and ZFS, and was wondering if it was suitable as a replacement for the on-site backups.
–
小太郎Mar 10 '14 at 0:26

8 Answers
8

What happens when your filesystem or RAID volume gets corrupted? Or your server gets set on fire? Or someone accidentally formats the wrong array?

You lose all your data and the not-real-backups you thought you had. That's why real backups are on a completely different system than the data you're backing up - because backups protect against something happening to the system in question that would cause data loss. Keep your backups on the same system as you're backing up, and data loss on that system can impact your "backups" as well.

How about this solution, since I run into it often? Are local snapshots + remote snapshots to another server (onsite or offsite) + RAID on both system a replacement for traditional backups?
–
ewwhiteMar 9 '14 at 14:23

5

@ewwhite Assuming they're restore-tested, and a complete copy of your data exists on a remote system, sure. Then it's basically a disk-to-disk backup... and I do love disk-to-disk backups.
–
HopelessN00b♦Mar 9 '14 at 17:35

For on-site backup, snapshot might be good enough, provided that you regularly 'export' your snapshot somewhere else, where it exists as passive data.

And, regularly test if your 'shipped snapshot' can be restored.

This is how I implemented a quick backup of some of my servers: store the data on ZFS, take a ZFS snapshot, send the delta to another server, where the whole filesystem is re-created (minus the actual service running).

Of course, the best backup is always off-site. Thus, after 'shipping' the snapshot(s) to a separate system, do a 'tape-out' of the snapshots regularly.

So, in my system, the server that receives the snapshot deltas, regularly dumps all its ZFS pools (including earlier snapshots) to tape.

And of course, test your tape-outs to ensure it can be restored.

Note: You will want the snapshot to take place during quiesced disk activity, and preferably in coordination with the database (if any) to ensure consistency; else, the cure might be worse than the illness. That's why NetApp & EMC 'live snapshot' feature is very useful: They will postpone a LUN's snapshot until the database using the LUN indicated that it's safe to carry out the snapshot.

Can you elaborate on how you dump your ZFS snapshots to tape?
–
ewwhiteMar 9 '14 at 15:53

@ewwhite you can always backup the .zfs/snapshots directory, or mount one of the snapshots somewhere else to do a tape-out. So it's a separate backup for different snapshots.
–
pepoluanMar 9 '14 at 19:32

I'm doing this with zvols, actually... so I don't have a .zfs directory to cd into.
–
ewwhiteMar 9 '14 at 19:36

@ewwhite Ahh, I see... in that case, you might be able to use zfs send $SNAPSHOT_NAME > $YOUR_TAPE_DEVICE, and later do a zfs receive $RESTORE_NAME < $YOUR_TAPE_DEVICE. However, I honestly do not have experience with backing up zvols, though...
–
pepoluanMar 9 '14 at 19:49

Proper backups are on a separate device than the device being backed up. What happens when you lose two or more drives? What happens when your server room burns down? What happens when someone accidentally destroys your array?

(Anecdote alert: I once heard of someone who had PXE set to auto-install the latest Fedora. His UPS failed. After a power outage, his server rebooted and was set to PXE boot and... installed Fedora over his data. My point? Freakish things happen. Fortunately, he had proper backups.)

Preferably, you have at least three copies of your data, one stored completely offsite in case the data center burns down.

Yes, it is. It is a perfect way to store backups. Nothing else is needed, heck, even doing ingtegrity checks are just wasted time.

Just to confirm - before I give more advice... you work for a competitor of mine, right? You really do, sure? No? Oh.

Sorry, NUTS. No, not at all. Sorry, dude.

Problem is that you are totally open to any error that happens in (a) the system and (b) the operating system level. You basically only protect against someone deleting some data. Nice. That IS an often occuring error.

What you are not protecting from is:

A power spike wiping out the machine. Been there, seen that.

Some defective raid controller or memory writing sh** on the disc - there goes anything.

And a long list of other things.

This is - naturally, unless you work for a competitor of mine - you always please make a backup:

On another computer

That you isolate from at least power spikes (even if you ahve a USV).

This is why tapes rock - they are not connected and anything short o a fire or flood will not hurt them. Power spike - there goes the tape reader and maybe the robot but the tapes not in the reader are not going to be affected.

BEST would be backups offsite (did I mention stuff like fire and flooding already?) (Again, when you work for a competitor - there is no such thing as a building fire, it is totally not needed, as is fire insurance, please, save that money).

Now, you may think "oh, flooding never happens". Make sure you are sure. See, here is a video of a 09.09.09 flooding of a vodaphone datacenter. I am sure you will understand where the issue is for a insite / in computer backup:

Properly implemented snapshots MUST be supported by your storage as decent backups do use them as a very first stage of creating a backup job. It's however a bad idea to use snapshots for primary backup. Reasons:

1) Snapshots and backend storage CAN fail. So real backups must be using separate spindle set or there's a great chance to lose both primary working set and backup data @ the same time.

2) Snapshots "chew away" usable space. It makes sense to use expensive and fast storage for current hot data and off-load snapshots and backups being an ice cold data to some cheaper and slower storage. It works very well with 1) BTW.

3) Snapshots usually slow down the whole proces. Most systems use Copy-on-Write and this approach creates fragmentation. Redirect-on-Write are faster but eat A LOT of space. Very few vendors have properly implemented snapshots. NetApp with WAFL and Nimble Storage with CASL (I'm not affilated with any of them). Pretty much everybody else have issues. For example Dell Equallogic trigger 15 MB page update (and waste) on every single byte changed. That's EXPENSIVE.

Lesson learned from two RAID-1 Drives failing within half an hour of each other: RAID is not a backup mechanism, not in any way, shape or form.

RAID is an availability mechanism that reduces downtime in case of hardware failure but it won't help you at all in case of e.g. Viruses, data deletion/modification or plain catastrophic hardware failure.

Many experienced administrators go with what is known as the 3-2-1 rule of backups:

You should have at least three copies of your data, including the primary source. I.e. a single backup is not enough and copies within the same physical system do not count.

You should be using at least two different backup methods.

You should have at least one off-site copy of your data.

Snapshots violate all three parts:

You only use a single physical machine. Anything affecting the whole machine, such as a PSU failure, could take with it all your data.

You are only using a single method for your backups. If anything is wrong with it, you will only find out when restoring the backup in a crisis situation.

You have no backups off-site. Floods and fires happen only to others, until they happen to you...

Therefore:

You need to have at least one backup on a separate machine on your LAN.

You need to have at least one backup that is not generated using snapshots. Perhaps a good-old incremental tar archive might be in order? Or an rsync based copy?

You need to have at least one remote backup, as far as possible from your current location and definitely not in the same building.

It should also be pointed out that block-level snapshots have about the same consistency guarantees as pulling the plug on your machine and then copying over the disks. In general, you would need to run fsck after a restore or hope that the journal is enough.

Filesystem-level snapshots should be better, but they still would not guarantee the consistency of your files. For many applications (database servers come to mind) copying the files of a live instance can be completely useless, since they could be in an inconsistent state. You would need to use their own application-level backup mechanism to ensure the existence of a clean copy - for which the 3-2-1 rule would apply as well.

Finally, keep in mind that right now we are only talking about copies of your current data. To guard against failures (or security breaches, for that matter) that go on undetected for some time you also need to have several past copies of your data for quite some time back.

Assuming btrfs snapshots are anything like ZFS snapshots in terms of consistency guarantees (and with how much inspiration btrfs draws from ZFS, I don't see why that would not be the case), the snapshot will represent the on-disk moment-in-time data. So the file system will be in a consistent state if you roll back to a snapshot, but if data is kept in RAM and only flushed periodically and that data is needed to make sense of what's on disk (c.f. database server software) then those particular files will very likely be in an inconsistent state after (or before!) the rollback.
–
Michael KjörlingMar 11 '14 at 12:14