How delta copying works

Delta copying is a optimized way of copying a file when an older version of the same file already exists at destination.

---

When no copy of a file exists at destination we have no option by to read every byte of the source file and then write them all into backup copy.

If we are then to make a small change to the source file and repeat the copying, the vast majority of data we'll be writing will be exactly the same as what's already on the disk.

So it only makes sense to try and eliminate these redundant writes, and that's exactly what delta copying is about.

Naturally, there is more than one way to approach this matter.

The rsync way

A widely-used rsync tool [1] uses two cooperating processes - one at the source and another at destination - which both read their own copy of a file, block by block, and talk to each other to compare block checksums. When checksums don't match, then the source process forwards respective block to the destination process, which merges it into the destination file.

That's a much simplified version of rsync. There's quite a bit more to its algorithm, because after all it was Tridge's PhD thesis :)

The biggest plus of rsync is that all you need is just two copies of a file and then it can one make look like another, expeditiously.

The biggest minus of rsync is that you need to have a copy of rsync running on the receiving end. Meaning, if your NAS doesn't support rsync, then that's it. No rsync for you.

Delta copying

Bvckup 2 and its older brother Bvckup take a different approach.

When a file is first copied, the app splits it into equally-sized blocks, computes a hash for each block and then stores these hashes locally.

On the next copy, as the app goes through the source file block by block, it re-computes the hashes and compares them to the saved versions. If they match (* see below), then a block is assumed to be unchanged and it is skipped over. Otherwise, it is written out and the saved hash is updated to its new value.

Easy-peasy. But what of them caveats, you wonder? Indeed.

The caveats

1. The last thing we want is to skip a modified block only because it happened to have the exact same hash as its previous version. The risk of this event is mitigated by using two separate checksums for each block, both of which are stored in a hash file.

Additionally, Bvckup 2 computes a full-file checksum using 3rd digest algorithm. In case when no block-level changes are detected in a file, this hash is verified to match its version from the previous run. If there's ever a mismatch, the file is re-copied in full.

2. Delta copying assumes that destination file remains unmodified between the runs. Because if it's not, then all our precious locally saved block hashes will simply be of no use.

Luckily, since we are in a backup software context, this holds true in a vast majority of cases. However as they say - trust, but verify.

To catch changes to destination files the app saves their size and created/last-modified time stamps alongside the block hashes. If these aren't an exact match to the reality on the next run, then destination file is deemed to be modified and the file is re-copied in full.

3. Delta copying is an in-place update algorithm. It works with a live copy of destination file, meaning that if we are to cancel/abort the copying mid-way through, we may end up with a partially updated file.

There's not much we can do about this, but to detect this regrettable development on the next run and deal with it appropriately.

For orderly cancellations the app remembers how far along the file it was, stashes this information with the hashes and then resume from this point on the next run.

For abortive cancellations the size-timestamp caching provision from #2 above will ensure that the file is re-copied in full on the next run.

The details

Delta copying is used only for larger files.

Files smaller than 2MB and files under 32MB that weren't modified within last 30 days are always copied in full. In older releases (including the last beta) the criteria was simpler, see [2] for full details.

---

Default block size is 32KB.
Per-block hashes are MD5 and a variation of CRC32.
Per-file hash is SHA1.

This means that we store 20 bytes of hashes per 32KB of raw data, plus a fixed per-file overhead of 40 something bytes. This works out to about 0.06% of data size, which is not that bad.

---

Internally, the delta copying routine organized into the reading-hashing-writing pipeline, operating fully asynchronously on a pool of I/O buffers.

The copying starts with the app issuing multiple read requests in parallel.

Once a request is completed, the I/O buffer is forwarded to the hashing module, which maintains a stand by pool of hashing threads. Once the buffer is hashed and if it appears to be modified, a write request is issued for it. Then, once the write request completes, the buffer is again used to read the next block in sequence and the cycle repeats.

Delta copying module comes with a lot of settings - from hashing thread count to buffer counts to read/write chunk sizes - all tweakable. However the app does a good job picking the defaults based on the exact disposition of source/destination - whether they are the same drive, whether they are on the network, whether they are over older or newer SMB protocol, etc. - so generally there's no need for messing with them.

---

So there you have it - the delta copying - a new best friend of your VM images and TC containers :-)

In short - push-style backups maximize the efficiency of delta copying.

---

When a backup is going over the network, there's often a question of where it's better to run the app.

If the app runs on the source machine, it's a "local-to-remote" or "push" backup. And when the app runs on the backup machine, it's a "remote-to-local" or "pull" backup.

---

Delta copying gets its speed benefits from being selective with writes. With push backups all reads are local (fast) and writes go over the network (slow). With pull backups all reads are over the network (slow) and writes are local (fast).

So if we are reducing the amount of writes, then with faster reads and slower writes the effect will be far more pronounced => push backups are better. In other words, running the app on a source machine will generally result in faster backups.

---

Additionally, push backups can also make full use of destination snapshot caching - an option that tells Bvckup 2 to preserve and reuse destination tree index between backup runs.

This option is On by default and it eliminates the need for re-scanning destination location on every run. When destination happens to be over the network, this may translate into considerable speed-up, especially if the backup is big, but its per-run changes are few and far between.

"Files smaller than 2MB and files under 32MB that weren't modified within last 30 days are always copied in full."

Let me see if I have this straight: In my source I have a folder tree of 11.2GB in 365,000 files, most under 2mb. Every one of those files will be copied on every bvckup2 run, even when date/time/size are unchanged?

Hi, the above one was a good explanation about such an useful program's feature, but I'm wondering something about it:

Is there a way one can force the "delta copying" mode, for specific folders/files, which one foresees they should be copied that way, and not in the regular, full mode?

I mean, let's say I have two folders, source and backup; there is an "example_123.x" file in the first one, which I regularly update, but it's filename's will not always be like that, but have some slightly variation with each update, let's say "example_124.x". Currently, if I want there is the chance Bvckup uses the "delta copying" feature for said updated file (because I'm positively sure it is just an updated version from the same file, and not a totally new one), I have to rename it previously, in order it matches that in the source folder, because otherwise, it is going to handle it as a completely new one, and transfer it fully; besides, if it does copy it in full, I will have to delete the previous "example_123.x" file, which would suppose an additional "deleting" task.

If this happens because 123 gets _renamed_ into 124, then bvckup2 can understand that - it will both rename 123's backup copy and preserve its delta copying state, so it will keep on delta-updating the file. This however requires "Detecting Changes" setting in backup settings to be set to "Use snapshot" (and not to "Re-scan on every run") and "Rename detection" enabled for files.

However if 124 is created from scratch and filled with 123's data, then the app won't be able to link these two files and it will indeed re-copy 124 from scratch.

What you are suggesting is effectively a kind of "hint" system to tell bvckup2 that THIS file and THAT file are two versions of the same source file even though it's not obvious. This is not a bad idea, but I strongly suspect that it will get very hairy when it comes to the implementation.

Yes, I think the only way of avoiding the whole copying process for source files and their equivalent updated ones, would be renaming the source level files with a generic name, and then using that generic name for all the newer files as well, so they match each time.

That's certainly not the more straightforward way of handling the updates for said cases, but will still be more efficient than copying full files each time, principally with larger ones.

What you mention about the "hint" system is kind of what I first thought of, about Bvckup2 being that smart how to detect that some files should be delta copied, even if they have some little differences in their filenames, whether it be through a kind of special index from said files, or some forced option specifically for them.

But of course, I know there should be miles away between just thinking about a kind of workaround like that, and having it implemented. ;)

I apologize if this question is obvious or has already been answered.
What happens if I modify a large file, Bvckup starts backing it up using delta copying, and then my source harddrive dies.
Am I left without a usable copy of the file (assuming I only have the one backup)?

Is there a way to have Bvckup pre-emptively compute the hash on destination files if they already exist prior to the first run?

I have 4TB of VHDX files from a Hyper-V virtual machine already on the destination, never touched by Bvckup. I want to set up a Bvckup job from the near-identical source VM to these destination files, using delta-copy. Because the delta-copy hash doesn't exist, it begins to copy the files in full.

I'd like to pre-compute that hash to avoid saturating the network for weeks. Perhaps a workaround would be to use Bvckup to make a copy of the destination files locally, and then transport the hash to the source side?

Perhaps a workaround would be to use Bvckup to make a copy of the destination files locally, and then transport the hash to the source side?

Exactly right, but there are some per-requisites.

You can indeed pre-create delta state by making a ephemeral backup to a temporary local location. However, delta state for each file stores the exact "created" and "last-modified" timestamps of the backup copy. So for this whole thing to work you will need to have timestamps on these temporary local backup copies match timestamps on their counterparts at your existing destination _exactly_.

For that:

1. Source and destination should be using the exact same file system - this has to do with the fact that NTFS, FAT, etc. they all trim timestamps differently. NTFS has a resolution of 100ns, FAT - 2 seconds and pseudo NTFS on NAS boxes - anything in between.

2. "Last modified" timestamps on source and destination copies of files are the same. This has to do with the fact that created/last-modified timestamps of backup copies are saved in the delta state and they are verified on the next run. If they don't match their live versions, the delta state is assumed to be out-of-sync with the backup copy and it is discarded.

So, to that end -

Create a temporary backup job, set it to manual, point at your source and destination and run a simulated run (via the right-click menu). Now, look at the log and check that all that Bvckup 2 plans to do is update the "created" timestamps for your files. If it plans to copy the data, then you are out of luck.

If it's just timestamps, then run the job and it sync the "created" timestamps and finish quickly.

Now, remove this job. Create another one, set to manual and point it at your source and temporary local folder. Run it and it will copy the files and compute all required delta state. Go back into the job settings and re-point it at your destination.

And that's it, you will should delta state that is sync'ed with your destination files.

... But ...

I must say that this is hacky. If you happen to have a file that only _appears_ to be unchanged between the source and destination, then it will be fundamentally screwed up by running a job that is set up as per above.

Now, remove this job. Create another one, set to manual and point it at your source and temporary local folder. Run it and it will copy the files and compute all required delta state. Go back into the job settings and re-point it at your destination.

What if I already have a different backup job which I want to add these items to (ie, currently these items are unselected in the existing job)? If I create my temp job, copy the files so that all the delta states get computed, and then add those files to the existing backup *before* deleting the temp job, will the delta states persist, or will they get deleted when I delete the temp job I was using for computing delta states?

I guess a related question is if delta states are shared across jobs if multiple jobs are working with the same files (and so delta states only get deleted when there are no more relevant jobs existing)?

will they get deleted when I delete the temp job I was using for computing delta states?

Delta state is a property of a job. When a job is deleted, the delta state is deleted too.

That said, delta state for a file "xyz.abc" is stored in \engine\backup-00xx\deltas\ directory in a binary file named after md5("path\to\xyz.abc"), e.g.

\engine\backup-0003\deltas\4d\e8\2a2e3965ed6ecfdc6ebef6473230.dat

whereby 4de82a2e3965ed6ecfdc6ebef6473230 is the md5 hash. So in theory you can move delta state for a file between two jobs if you know the exact file path/name (with path being relative to the job's From folder).

I think there used to be a way to force backup of a very large file, regardless of any date/timestamp/size checks, and also to force delta copying to be used for that file. Has that capability been removed? It was useful for container files where the external container file information doesn't ever change (by construction).