Parallel execution

Do more, faster

Overview

This article provides an overview of multi-threaded
backup processing, a mode in which multiple backup steps are
executed in parallel.

Our goal is to see backups complete as quickly as possible,
so the discussion is framed relative to this.

Executive summary

Executing several backup operations simultaneously results in a significant
increase of overall processing speed, with one notable exception.

Speed improvements are especially pronounced in backups that
include a lot of small files or that go over the network.

Local backups also benefit from parallel execution, especially
when using non-mechanical drives, e.g. SSDs or NVMes.

The case of copying large files requires special handling which is described below.

Synchronous API

An average backup run involves creating, updating, moving and deleting file and folders.

These operations are carried out by sending requests to the file system and then awaiting confirmation.

The trouble is that most of these requests are synchronous.

This means that the operating system doesn't return control to the
program until the request is complete. The program is basically
stuck waiting for Windows to be done with a request.
Of particular interest here is the time that the program spends
idling while requests are in transit -
These slivers of gray may not look like much, but once we
have 100000 folders to create, they start to add up.

Worse yet is the case of working with a remote network share
over a high-latency connection -
The overhead of passing requests back and forth can be so
substantial that it may end up representing most of the processing time.

The async way

The obvious solution here is to not wait for a request
to complete before firing off another one.

This effectively creates a request queue and ensures
that the OS is kept busy processing our work for as long as
we have work to give it.
This is asynchronous processing. We issue a request
and have it complete at some point later on, freeing us to
do other things instead of waiting.

So now that we know that we want asynchronous processing,
the question is how can we get a blocking CreateDirectory request
to behave this way.

Luckily, it is very easy to do and the answer is threads.

However threads are just a means to an end.
What we are really after is the asynchronicity
and in one important case it is readily supported by the OS.

Working with files

All operating systems, Windows included, use the same pattern for working with files.

Before a program can do anything with the file contents or
its meta data, the file needs to be "opened".

Opening a file involves passing the file name to the OS along
with access requirements and other details. The OS verifies
that the file exists, the requestor has required privileges,
the file is not locked for exclusive use by some other program,
etc.
If all this checks out, the OS responds to the program with
a "handle", which is basically a file access token.

Then, all further operations with a file - reading, writing,
querying or setting of its attributes, timestamps, etc. - are
done by using this handle.

Finally, once the program is done with a file, it needs to
return the handle to the OS, which is referred to as "closing
the file."

Opening a file is a relatively expensive operation,
meaning that it may take as much time as, for example,
reading the contents of a small file. It is also
synchronous.

By comparison, the actual file reading and writing can be done
asynchronously. Combined with certain Windows kernel features
this asynchronous data transfer can be made to fully saturate
all the available bandwidth of the file system stack
with just one thread.

Asynchronization

Threads

Every program starts with a single thread. As this thread
goes through the code, it may create additional threads
and set them to execute other bits of code.

Threads are part of core programming abstractions, flexible
and very powerful. For a good intro on the subject see
here.

Parallel execution

The key property of threads is that they execute in
parallel.

Threads are also typically made to cooperate with each
other, i.e. one thread "delegates" some work to
another thread and then collects the results once they
are ready.

So if we want to execute a blocking request asynchronously,
all we need to do is spawn a thread, tell it to issue
that request on our behalf and then get back to us when
it's done.

There's not much to it, but there are... details.

Details

1. Dependencies

In the context of making backups, we typically have a long list of simple steps that, when executed, brings the backup in-sync with the source.

The order of steps is important, because some steps may depend on
the others. For example, we can't start copying a file until its
parent folder is created, nor can we delete a folder until all of
its contents are removed.

Fortunately, if a backup program implements a backup planner,
all this is taken care of transparently. Two file trees go in -
one for the source, another for the backup - and a list of backup
steps comes out, in the right order and with all dependencies specified.

Still, the first detail to not be overlooked is that parallel
execution needs to observe and respect inter-step dependencies.

2. Thread count

Second is a question of how many threads to use.

More is not necessarily better.

For local operations there should be as many threads as
there are CPU cores. Spawning more threads will only cause them
to compete for CPU time, resulting in slower overall
execution time. Spawning fewer threads will usually be
sub-optimal.

For remote operations it depends, but the local
CPU count is also a good starting point. For links and networks
with high latency increasing the thread count may improve
throughput, but for a regular LAN the effect is likely to be
small, if any.

Conversely, there's little point in reducing the thread
count, because spawning too many threads here will
only cause a bit longer request queue there, which
is (usually) not harmful to performance.

That is, the rule of thumb for remote backups is to start with
the local CPU count and test larger counts when fine-tuning.

3. File copying

The third gotcha has to do with file copying.

When copying smaller files, the time needed for opening and closing
them is comparable to that spent on reading and writing them.
If we are to copy a stream of small files, we will be looking at a lot of stalling -
=> We do want to be copying more than one small file at a time.

But remember how we can saturate the IO pipeline by copying a single large file?
If we are to throw another large copy in the mix, it will cause transfers to
compete for bandwidth and slow each other down.

=> We do not want to copy more than one large file at a time.

The solution lies on the surface - we limit file copying to a single
large file at a time, whereby a file is considered large
if it needs more than X requests to be read in full.

Bvckup 2 uses a threshold of 32 requests, which may seem a bit high,
but has proved to work well in practice.

4. Error handling

Retrying on transient errors in the presence of parallel execution
is another pitfall.

When there are X requests pending and one of them fails with,
say, network unreachable, what's the right thing to do?

On one hand, the program may just retry each failed request
after a pause. However if the network remains down, this will
pollute the backup log with redundant failures and generally
create a lot of fuss where none is needed.

So instead the program would need to try and exit the retry
pause state gingerly, by retrying a single step first and
following up with the rest if all goes well.

But then there are also edge cases.

Sometimes a network hiccup will cause only some requests
to fail right away. More will fail in a minute, few more in
10 minutes, and some will actually manage to complete, but
taking an hour to do so.

As a result the retry logic ends up containing a lot more
complexity than may seem initially necessary. Caveat
emptor.

5. Memory utilization

Using extra threads invariably means higher memory usage.

Usually the per-thread overhead is not very big unless a
thread needs to copy file. Copying involves allocating
multiple IO buffers and with modern drives these often need
to be large to maximize performance.

It's not atypical for the copying module to use 32 MB in
buffers per file transfer. Multiplied, say, by 16 cores,
that's a half of GB.

However, since we do not copy large files in parallel,
we kill two birds with one stone and get a free pass with this
issue.

Robocopy /mt

As you may know Windows ships with a very handy file copying utility called
robocopy.

One of its many features is the support for multi-threaded
file copying, done with the help of the
/mt command-line switch.

There are two differences between robocopy /mt and
Bvckup 2.

First, /mt applies to the file copying only. All other
operations, e.g. folder creation and item deletion, are still
done sequentially.

Second is that it does not differentiate between small and
large files and runs all file copies in parallel, unconditionally.

* There are other important differences,
between the two, but these aren't related to the multi-threaded processing.

Sample numbers

The following numbers are meant to give an idea of how parallel
and sequential modes compare in terms of performance. These are
specific to replicating from a local NVMe drive to a local SSD
and to a remote share on low-end NAS over Gigabit Ethernet.

Exact numbers and speed-up rates will vary greatly
with the setup and the state of the environment. When running
your own tests, make sure to compare apples to apples. In
particular, the state of the file system cache - warm vs. cold
- may sway test results by orders of magnitude.

That said, it is not uncommon to see speed-ups in
full multiples of single-threaded numbers when
using very fast storage and networks.