There is an utility we have which is used to upload files (and perform other operations on the file) to a network shared location.
The file size tends to vary from a few mb to 500 mb.
A suggestion has come up that we should maybe support multi-threading when uploading the files to the shared location - not required to do it in byte chunks - each thread should pick one file and try to upload.

I am not so sure if multithreading can speed up IO operations like this. Is my hunch valid?

If indeed we are required to build this functionality I was wondering what would be a good design approach for the copy file engine?
Would it make sense to use a tool like robocopy (I read the newer versions support multithreading)?

Edit: Apologies for the delay and missing some vital information.
This utility is built using C# (.Net 2.0) and any future update also has to be using .Net (framework version is not a constraint). The utility is installed on the machines of the users (around 20 all on WinXP). The target share is on Win2k3 server.

Edit 2: have decided to run some tests with a simple application implementing the file upload through TPL. Post this analysis we will decide whether to go ahead or not. Thanks everyone for the help extended.

What programming language? In C, a more idiomatic approach might be to use asynchronous I/O, using a select loop instead of threads. Although doing so requires you to "turn your code inside-out" (the code to copy a file isn't a straightforward sequence of commands anymore), you won't have to worry about thread synchronization.
–
Joey AdamsDec 2 '11 at 23:43

Probably the easiest reasonable solution is to let the OS handle it all: SHFileOperation(FO_COPY). That gets you all optimizations that the people at Microsoft considered reasonable.
–
MSaltersDec 30 '11 at 13:15

Cough robocopy cough... you could automate it with something like robomojo
–
James SnellJul 9 '13 at 10:40

9 Answers
9

This depends on what the limiting factor is, doesnt it? If the bottleneck is the utility program, then sure, running more than one copy or using more threads will speed things up. If the network is the limiting factor, then adding multiple instances of the utility isnt going to help since you still will be stuck moving at most X bytes per second. In fact it might hurt because you have the additional overhead of a second copy of the app. Same with disk-IO. You can only copy as fast as either machine can read from and write to disk. If thats already max'd out, adding copies isnt going to help.

What you need to do is test to see what the bottleneck is, and go from there.

If you are talking about one big file, multithreading won't really help. You're going to be I/O bound, so using a single thread won't slow THAT upload down.

What you do may to worry about, though, is resource contention (assuming you are writing the server, too). If you're handling the upload in the thread that also accepts and processes new requests, other requests will be waiting. As long as you defer back to the selector queue after reading a chunk from the socket and writing it to disk, though, you should be fine.

While copying a lot of smaller files, multithreading can help because there tend to be gaps in the data transfer while the program is searching directories for the next file, opening it and getting the data.

Multithreading will also help when the client and server both have parallel data storage such as RAID or SSD: anything that performs better with higher queue depth numbers.

Other than that, it will often slow things down. For example, making a single hard drive read or write two files at the same time will force it to repeatedly seek from file 1 to file 2.

Multithreading will not help in the way you are thinking about it, since there most likely exists only one path of communication between the client and the server, the client is most likely reading files off from a single hard-drive, and the files are most likely being written on a single hard-drive on the server. (RAID will make some difference here, but not much.) On the contrary, as it has already been pointed out, performance will probably degrade, because there will be constant seeking between the files that are being read in parallel on the client, and constant seeking between the files that are being written in parallel on the server. Also, the files may end up being badly fragmented on the server.

However, multithreading can help in a different way: with just two threads on the client and another two threads on the server, the file I/O can be desynchronized from the network I/O. This means that the client can be simultaneously transmitting a chunk of a file while reading the next chunk from its disk, while the server can be simultaneously writing a chunk of a file on the disk while receiving the next chunk from the network. This would greatly speed up the transfer process. I would guess that every specialized file copy utility out there should be smart enough to do that, but I may be wrong, so if "Robocopy" advertises that they do multi-threaded copy, that's fine, go with that.

Doing what you suggest in a naive fashion will kill your throughput, the choke point is disk I/O and not getting files ready.

I'll suggest using one thread that receives files to work with and queues them up for the copy and then keeps a sequential copy going on anything in the queue; your supplier thread is responsible for getting the files read to queue up. In this way you're not thrashing the file system on the shared drive(s) and you're not doing files one at a time with gaps to prepare the next, you're preparing and sending simultaneously.

Bonus is that there's only one synchronization point at the queue to worry about.

I work for Data Expedition, Inc. which, as Emmad mentioned, produces
commercial software for this type of scenario. Multithreaded file transfer
can have benefits, but you have to carefully understand what your performance
bottlenecks are.

Any network path will have at least dozens of hardware and software components
that the data has to pass through. The slowest of all of them will determine
your speed. But how you move the data will change how those components
behave.

Running parallel TCP's can help when individual TCP speeds are falling far
below the capacities of the network, the disk, and the CPU.

But if you're looking at network speeds of more than tens of megabits per
second, then parallel data transfers will exponentially reduce your disk I/O
due to thrashing the hard-drive. It can quickly fall to the point where disk
access becomes much slower than the network capacity. Choosing the right
read/write block size can help, but that will depend on the particular
hardware. Also keep in mind that Windows XP/2003 has very limited paged-pool memory, which can make it unstable if speeds go over about 200 megabits per second.

On the flip side, if the network is slower than a few tens of megabits per
second, then running many parallel TCP's can push the latency up to the point
where individual sessions begin to slow down or even drop their connections.
Again, its a matter of experimentation to find what level of parallelism will
work for any given path and conditions.

So, multithreaded file copy can help if you have a known data path and can
take the time to fine tune the number of parallel sessions and your disk I/O.
But it does require that you retune whenever conditions change, and can be
disruptive if you overdo it. That's why we have chosen to avoid parallel
transfers in our own software, just as we avoid TCP.