ContentSync – a content-based file copy tool

Suppose you need to sync the contents of two large folders, Source and Destination. Normally robocopy *.* Source Destination /MIR does the job. However even if the byte content of a file didn’t change, but the timestamp did, robocopy will copy the file and change the timestamp of the destination to match the source.

This is the safe and expected behavior, however there are whole classes of tools that track file modification based off the timestamp alone. For instance, MSBuild will trigger a cascading rebuild of all dependencies of a file if its timestamp has changed (even if the actual file contents is exactly the same). This is called overbuilding. A scalable build system should detect that the file contents didn’t change and avoid doing any work in this case.

Or take another example, suppose you need to upload hundreds of thousands of files to Azure using MSDeploy. If the timestamp on those files has changed, MSDeploy will upload those files even if the actual content is the same.

In general, if you’re deciding whether a file was modified based off the timestamp, you’re bound to schedule unnecessary work that could have been avoided if you checked whether the actual file bytes have changed.

Long story short, I wrote an open-source tool to sync/mirror directories based on the file content, not the timestamp:

I needed this for the exact Azure website deployment scenario: I maintain http://source.roslyn.io, and we’ve recently moved from on-premises Microsoft hosting to Azure. I use https://github.com/KirillOsenkov/SourceBrowser to regenerate the website every night, and that’s 160,000 modified files totaling over 1 GB. You don’t want to upload that to Azure every night.

So I inserted an intermediate step into my deployment script – I ContentSync from the freshly built Index folder (it’s not incremental, all the files are generated from scratch every time) into a Staging folder from the last time. ContentSync doesn’t touch the files that haven’t actually changed (and that’s 99.9…%), and so MSDeploy doesn’t upload them since the timestamp is the same.

Since I'm not an expert on msdeploy I'd rather not write about something I don't know about and have never used. After a quick search it seems that that functionality has bugs and maybe other things I'm not aware about. I'd rather leave this as an exercise to the reader 🙂

Sergio – thanks! Octodiff seems to be completentary to this tool – it knows how to efficiently copy a single file (whereas I just use File.Copy). Also my tool is extra bad in that respect, it reads the remote file twice (first to compare, then to actually copy).