Sponsor

In a previous article in which i’ve show some uses of tar, I made an example of how to use it to move large amounts of data between two computers, but many people have said that it is better, or at least they prefer to use rsync, others prefer to use netcat. I remain convinced that a tar+ssh is faster than rsync+ssh is correct then do a test on the field and see some numbers.

For the test i’ll use 2 computer connected on my home network, both are connected with wi-fi to my router, i think the bottleneck will be (for the ssh test) the cpu of laptop1 (centrino 1.5 GHZ).

So we have an average speed of 2.2 Mbits/sec, not much but we are interested in comparing results so speed is not an issue in this test.

First test

I’ll copy from laptop1 to laptop2 the directory /var/lib/dpkg containing 7105 files that take up 61 MB.Because rsync will compare changes between source and destination, I made sure to nuke the directory off the server after every run.

So with my computer (and my network) the best way to transfer files appears to be a rsync with ssh.

Second test

Not satisfied with the results I ran exactly the same tests on two servers connected via gigabit ethernet.
For some errors i have not been able to measure the time of netcat, it worked properly and but did not come out so I could not take the times (which were still very low).

Rsync+ssh and tar+ssh completed all the work correctly in less than a second so I’ve included the times to the hundredth of a second.

Conclusion

Honestly I expected a ranking like this: netcat > tar+ssh > rsync+ssh > scp, and I was promptly refuted by the evidence on my PC.

Similar tests made on another blog have been very similar to what I expected, so my conclusions are:

1) Never use scp, it is the worst method you can use
2) Depending on the network and computers you have available you will have different results with rsync and tar.
3) On fast networks tar + ssh seems to be slightly more performing, I expect that netcat that has encryption off (comparated to ssh) can give even better results.
4) Rsync can work in an incremental way so it has very strong advantages over other methods in terms of usability if you know at the start that the operation should be performed several times (for example do more sync in the same day between two sites).

3 Responses to “The best way to move data”

I did some similar tests with large datasets (300 GB +) and found tar over ssh the be the most reliable for the speed. One thing I did notice, adding options to the ssh command can really have an impact on performance.

If you are doing file transfers on a local lan and are not too concerned about full encryption. You may want to add: -c arcfour

Host Key Checking causes considerable pauses in establishing connections. Once again if you are on a lan, you may want to add: -o StrictHostKeyChecking=no

And one other note, if you do publickey authentication adding: o PreferredAuthentications=publickey can also cut down on the time it takes to establish a connection.

Thanks for doing this test. Debian and peers steered me to tar, sftp and finally rsync. The man page for rsync in Debian is excellent and contains many common case examples. Thanks to your test, I now know that my prefered method, rsync, is also one of the fastest.

I like rsync to much that I’ve started using it as an improved copy for local files. Grsync is a nice GUI front end that works well with frequent but irregularly timed syncs because it stores “sessions”.

I’m not sure of the syntax you are using for “rsync + ssh.” The manual advises:

rsync -avz source:/local_files/ target:/remote_files

I was under the impression that a single colon invoked ssh by default. “man rsh” brings up the ssh man page. Compression, the -z, might dog slow cpus, but it might be faster than your 802.11B.

Another advantage to rsync is easy resume capability. I’m not sure how the other methods would handle a break in commuication.

SSH really isn’t all that CPU-consuming. Sure, it’s encryption, but symmetric encryption (which is used after the initial authentication) can be quite fast even in software. For instance, try “dd if=/dev/zero bs=1048576 | openssl enc -blowfish -k somepass > /dev/null”. On my machine, it performs at 91,2 MB/s, almost enough to fully saturate a 1GB-line with payload, and faster than many disks even in sequential read.

Regarding rsync vs. tar, rsync is optimized for efficient network-usage, not efficient disk-usage. In your particular case, it sounds like many small files was the challenge. While rsync is usually the winner in partial updates, it has a strong disadvantage in the first run, since it scans the directory on both machines first to determine which files to transfer.

Also, if network is the bottleneck and not CPU, you can optimize your tar-example with compression, for instance bzip2 or xz/lzma. My /var/lib/dpkg went from 90MB to 13 using bzip2, with a throughput of 6.1mbit, so adding a -j to your tar-line it might improve your low-bandwidth-example even more than rsync if you want to prove a point.

Also, rsync uses compression by default, so disabling it /might/ improve performance in the high-bandwidth-case.