This is my first post to this list. I have been running backups using rsnapshot for the last 6 years and encountered my first problem last week. When recovering files from a snapshot, we noticed a few cases of data corruption. The affected files were fairly old (several years), and we have not been able to locate the source of the error. It could possibly have happened during a raid disk failure a few years ago on the volume where the snapshots are stored.

The reason the corrupt files have been able stay all this time on is the default rsync -a strategy of only comparing file modification dates. Temporarily changing the rsync flags to -ac, meaning rsync will checksum files, turned up several other cases of corrupt files. Unfortunately rsync -ac made the entire sync process for take >5 hours instead of ~10 minutes.

I would suggest running a checksummed sync more seldom (such as once a month) to verify the integrity of the snapshot. It would be nice if rsnapshot could be extended to take a flag to use an alternate rsync method without having to resort to keeping two different configuration files up to date.

On another note, on a new setup, I recently tried the rather poorly documented "sync_first". It seems the suggested method of using this is to run by crontab:

rsnapshot sync && rsnapshot hourly

However, this resulted in the unfortunate consequence that the snapshot failed to rotate in the event of any (out of several dozen) clients failed to respond. It would be nice to be able to discriminate between the two error conditions of having an inconsistent snapshot, due to a synchronization failing in the middle of its operation, and a consistent but possibly outdated snapshot, where one or several backup locations failed to be contacted. The snapshot rotation should be performed in the latter case, but not in the former.

This leads me to a more general problem of handling a multitude of clients under the same snapshot tree. If, for example, one client dies permanently in the middle of a sync operation, then its latest snapshot will be inconsistent(1). An inconsistent snapshot should never be rotated, however, good snapshots should. Unfortunately, since all different backup locations are stored under the same tree, it is hard to execute one such behavior. The inconsistent snapshot of the dead client could prevent a rotation indefinitely.

Best regards,

Oskar

1) I am not sure if rsync is able to guarantee that file updates are atomic, but even if they are, having an inconsistent set of files could be a problem.
------------------------------------------------------------------------------
Benefiting from Server Virtualization: Beyond Initial Workload
Consolidation -- Increasing the use of server virtualization is a top
priority.Virtualization can reduce costs, simplify management, and improve
application availability and disaster protection. Learn more about boosting
the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss

Mon Apr 18, 2011 2:15 am

Helmut HullenGuest

Regular integrity check

Hallo, Oskar,

Du meintest am 18.04.11:

This leads me to a more general problem of handling a multitude of
clients under the same snapshot tree.

Sorry - that's no good idea.

If, for example, one client
dies permanently in the middle of a sync operation, then its latest
snapshot will be inconsistent(1). An inconsistent snapshot should
never be rotated, however, good snapshots should. Unfortunately,
since all different backup locations are stored under the same tree,
it is hard to execute one such behavior. The inconsistent snapshot of
the dead client could prevent a rotation indefinitely.

It's a better way to work with separate *.conf, *.log, *.pid and root-
directory. Then no job can disturb another job.

On Mon, Apr 18, 2011 at 3:12 AM, Oskar Linde <ol < at > csc.kth.se> wrote:
This is my first post to this list. I have been running backups
using rsnapshot for the last 6 years and encountered my first
problem last week. When recovering files from a snapshot, we noticed
a few cases of data corruption. The affected files were fairly old
(several years), and we have not been able to locate the source of
the error. It could possibly have happened during a raid disk
failure a few years ago on the volume where the snapshots are
stored.

The reason the corrupt files have been able stay all this time on is
the default rsync -a strategy of only comparing file modification
dates. Temporarily changing the rsync flags to -ac, meaning rsync
will checksum files, turned up several other cases of corrupt
files. Unfortunately rsync -ac made the entire sync process for take
5 hours instead of ~10 minutes.

I would suggest running a checksummed sync more seldom (such as once
a month) to verify the integrity of the snapshot. It would be nice
if rsnapshot could be extended to take a flag to use an alternate
rsync method without having to resort to keeping two different
configuration files up to date.

I am in much the same boat, with more than five years of backups, and
I have begun to worry about bit-rot. Unfortunately, you do not
require an explicit event such as raid failure to cause this to
happen. If you put data on a disk and do not read it for a long time,
you can expect to get a certain (very low) level of corruption. It
just happens, and, as you mention, unless the corruption happens in
the metadata, rsync isn't going to notice that the actual bits changed
versus the original file.

Unfortunately, simply doing a checksum sync only solves the problem
for files which continue to exist in the primary location. Also, it
does not deal with the case where the primary location is the one
where the bit-rot is happening!

I think the real solution is a checksumming/scrubbing system which
runs independently on the primary data and the backup data. The
checksumming part is that when new files arrive or files are changed,
they are checksummed, with the results stored in a central location.
Then, on a periodic basis, a fraction of files have their checksums
checked, arranged such that all files get their checksums checked on a
particular interval. Then when there's a mis-match, the operator can
swing into action and fix either the primary or the backup.

There is probably software out there to manage this, but I've always
wanted something which was rsnapshot-aware, so rather than being
pretty well protected now, it seems I'd rather have better protection
in the mythical future where I have time to write the script .

This leads me to a more general problem of handling a multitude of
clients under the same snapshot tree.

Sorry - that's no good idea.

Thank you. You have me convinced.

If, for example, one client
dies permanently in the middle of a sync operation, then its latest
snapshot will be inconsistent(1). An inconsistent snapshot should
never be rotated, however, good snapshots should. Unfortunately,
since all different backup locations are stored under the same tree,
it is hard to execute one such behavior. The inconsistent snapshot of
the dead client could prevent a rotation indefinitely.

It's a better way to work with separate *.conf, *.log, *.pid and root-
directory. Then no job can disturb another job.

That sounds like the best way to do it. Now, how would one manage conf-files and cron-entries for circa 50 different clients in a reliable way?

Unfortunately, simply doing a checksum sync only solves the problem
for files which continue to exist in the primary location. Also, it
does not deal with the case where the primary location is the one
where the bit-rot is happening!

A relatively easy way of doing this would be to periodically (e.g. once
a month):

1. Use rsnapshot in LVM-snapshot mode.
2. Run the normal rsync job, but don't discard the LVM snapshot on
completion.
3. Run the normal rsync job again, but with --checksum --dry-run (i.e.
only print files with different checksums, but don't do anything in case
the source has gone bad).
4. Complain loudly when "3." gives any output (i.e. bitrot, or
deliberate mischief by hackers etc.).
5. Discard the LVM snapshot.

I am in much the same boat, with more than five years of backups, and
I have begun to worry about bit-rot. Unfortunately, you do not
require an explicit event such as raid failure to cause this to
happen. If you put data on a disk and do not read it for a long time,
you can expect to get a certain (very low) level of corruption. It
just happens, and, as you mention, unless the corruption happens in
the metadata, rsync isn't going to notice that the actual bits changed
versus the original file.

Unfortunately, simply doing a checksum sync only solves the problem
for files which continue to exist in the primary location. Also, it
does not deal with the case where the primary location is the one
where the bit-rot is happening!

In both cases, consider tripwire <http://www.tripwire.org/> or similar.

Its *intended* use is to spot unauthorised changes to binaries, but you
could also use it for verifying your backups. Run it daily or whatever
on the machines you're backing up, and make sure that the database of
checksums that it builds gets backed up. Then also run tripwire on your
backups.

Things to beware of:
* you don't care when it tells you that the number of links to a file
is wrong;
* you don't care if the inode number is wrong;
* you don't care if the owner/group are wrong (although you do care
about the uid/gid - but they might map to different user/group names
on your backup machine);
* be careful that you don't run rsnapshot weekly or anything else that
might rotate backups at the same time as you're running these
checks, for obvious reasons! So you might want to run it in a
wrapper that touches the rsnapshot lock file;
* checksumming a large directory tree takes a *very* long time.

I don't do this myself, but I probably should. It will catch media
errors as well as plain old bit rot.

--
David Cantrell | Godless Liberal Elitist

All children should be aptitude-tested at an early age and,
if their main or only aptitude is for marketing, drowned.

Which dynamically constructs a conf-file for each location with proper lockfile, logfile and a default snapshot_root. Having many backup locations, this prevents me from messing up any crontab entry or local conf-file.

It is certainly a hack, but allows me to add new backup locations by just creating a one line file:

/etc/rsnapshots/location
backup <source> ./

that will automatically place the snapshot at /snapshot/location/, and log to /var/log/rsnapshot/location.log

However, that behavior means that no (hourly) snapshot rotation is performed if any warning is encountered. It is quite often necessary to make snapshots on live filesystems, and using rsync, getting warnings about for example files that have gone missing during the sync operation is as far as I know in many cases inevitable.

I would guess that a better option for most cases would be to instead run something like:

rsnapshot sync ; test $? -ne 1 && rsnapshot hourly

To differentiate between the exit values:

0 All operations completed successfully
1 A fatal error occurred
2 Some warnings occurred, but the backup still finished

Now, the manual doesn't make it perfectly clear what the differentiation between fatal errors and warnings are. What are the most common cases causing warnings and errors respectively? Are there any obvious problems in regards to the strategy above?

With the sync_first strategy, how would one best handle (and potentially block) other rotation intervals (e.g. daily, weekly), in case of a long term problem?

This is my first post to this list. I have been running backups using rsnapshot for the last 6 years and encountered my first problem last week. When recovering files from a snapshot, we noticed a few cases of data corruption. The affected files were fairly old (several years), and we have not been able to locate the source of the error. It could possibly have happened during a raid disk failure a few years ago on the volume where the snapshots are stored.

The reason the corrupt files have been able stay all this time on is the default rsync -a strategy of only comparing file modification dates. Temporarily changing the rsync flags to -ac, meaning rsync will checksum files, turned up several other cases of corrupt files. Unfortunately rsync -ac made the entire sync process for take >5 hours instead of ~10 minutes.

I would suggest running a checksummed sync more seldom (such as once a month) to verify the integrity of the snapshot. It would be nice if rsnapshot could be extended to take a flag to use an alternate rsync method without having to resort to keeping two different configuration files up to date.

On another note, on a new setup, I recently tried the rather poorly documented "sync_first". It seems the suggested method of using this is to run by crontab:

rsnapshot sync && rsnapshot hourly

However, this resulted in the unfortunate consequence that the snapshot failed to rotate in the event of any (out of several dozen) clients failed to respond. It would be nice to be able to discriminate between the two error conditions of having an inconsistent snapshot, due to a synchronization failing in the middle of its operation, and a consistent but possibly outdated snapshot, where one or several backup locations failed to be contacted. The snapshot rotation should be performed in the latter case, but not in the former.

This is why you split the distinct targets into multiple repositories
with multiple rsnapshot configs, each with their own failuress and
successes. You can unify most of the configuration by using shared
"include" files, and merely alter the target directories and lockfiles
as appropriate.

Once you've done that, you can theoretically run "hardlink" separately
against the multiple repositories to de-duplicate files, but that can
cause surprises when you've stored everything efficently, but you've
run out of inodes....