Rsync on Steroids

Rsync is an incredibly
powerful tool that synchronises anything from a single file to an entire
hierarchical filesystem, over a network. Unlike many other
synchronisation methods, rsync will use the outdated copy of a file to
save on network traffic (resulting in anything up to 99% optimisation).

Rsync the implementation
however is restricted to only Posix systems (such as Linux, Cygwin and
*BSD), and, worse, its implementation can only perform operations on
Posix-based filesystems. This seems somewhat puzzling, and, as part
of the continued Tech Fusion series, this
article will outline some of the amazingly powerful things that could be
done with rsync... if it had a VFS layer.

Rsync (the application) performs directory-by-directory and file-by-file
synchronisation of a filesystem hierarchy - a POSIX-compliant
filesystem hierarchy. Recent modifications to rsync show already some
of the limitations of the current approach: storage of userid
information into extended attributes when rsync is running as a daemon
has just been added! The reason is because rsync as a daemon cannot be
run as root, and so, when attempting to synchronise file permissions
and userid attributes, thus maintaining file system integrity when
performing backups, the previous version of rsync simply threw that
information away. As a hack, the information is now stored in
"extended attributes" - if an ext3 or other filesystem is used - for
later retrieval on a restore / recovery.

How much better would it be if rsync had a VFS plugin layer, such that
the storage of userid information and other attributes could be put
into an alternative database, of which "storage in extended attributes"
was just one example? Would it be nice to be to store that information
in a format that was compatible with backuppc?

Or - how about storing an entire filesystem into a Tar ball? TAR (Tape
Archives) have supported userid attributes, last modified dates and
permissions for decades. Heck, while we're at it - what else is
a "hierarchical storage" mechanism in the I.T. world? NTFS and HPFS;
XML files and HTML files; Structured
Storage and Streams; GVFS and KDE's KIO VFS plugins; FUSE and other user-space file
systems; heck, even wget
could be back-ended
into an Rsync plugin at one end: in combination with a TAR plugin at
the other end you could make regular compressed backups of web sites
(ok - smart readers will have noticed that the last is stretching
things a bit, but wait - there is rproxy! oh darn. hmm... even
smarter readers will have noted that U.S. patents are only valid
in the U.S. but frequently any patent usually results in a
piece of software development being stopped, dead. we neeeed to
do something about this, even if it means putting a notice on rproxy
that it must not be distributed in binary form to the United States,
until Software Patents are
neutralised. but anyway - sorry for the interruption!).

What else is "hierarchical"? IMAP (and to some extent POP3)
mailstores. How about going actually into the mail messages
themselves, unpacking attachments, then looking across the entire
mailbox for similar attachments, and performing a pseudo-sync of the
"old" version of the attachment and the "new" one? How about doing the
same thing across filesystems themselves?

How about the idea of optimising rsync on the server, by storing the
(expensive-to-calculate) MD4 block checksums in a database? One of the
reasons why rsync is not that widely deployed (Debian mirror sites often
do not run rsync) is because of the amount of checksumming that's
carried out, each time the file is sychronised. However, if you can
guarantee filesystem integrity because the entire filesystem is stored
not in a POSIX-compliant filesystem but actually in a SQL database,
along with the MD4 checksums, actually splitting the files up into
"blocks" rather than storing the entire file as one contiguous binary
blob, then you've immediately got not only a method for
optimising file storage space (if blocks occur more than once across
many files or even the same file) but also you've saved yourself a great
deal of CPU time not having to look up the MD4 checksums.

How about storing a hierarchical file system in GIT? (yes -
i noticed that GIT itself can use rsync for synchronisation - but
I'm talking about rsync using GIT for file storage!).

The list of possibilities are just incredible.

My favourite has to be an IMAP plugin, though, because then finally you
can keep as many "offline" copies as you want of your mailbox
synchronised with the "online" copy. This is one of the things that
Exchange has which has no equivalent offering from Free Software
projects (that i know of). In Exchange, synchronisation is a
dog, causing immense aggravation to users. An rsync IMAP plugin would
allow users to install an imap daemon on their own local system - a
desktop or even a PDA - which then automatically synchronised email in
a highly efficient manner.

Likewise, even sending of email - rsync with an SMTP plugin - could
perform "synchronisation" over to a server before sending it out over
the Internet. Close integration between the IMAP plugin and the SMTP
plugin could result in massive savings of network traffic, which would
be very handy on GSM/GPRS connections, based on analysis performed by
the plugins, looking at file attachments that had already been
transferred, or modified only by tiny amounts, and transferring only
the differences rather than the whole email message.

(whilst we're at it - this of course hints at the possibility of
doing away with SMTP altogether, especially with a peer-to-peer
distributed IMAP server. Think about this: when you "send" an email,
where is a copy first stored? in your IMAP "sent mail" folder. So why
send it via SMTP at all? why not drop a DHT-based "notification"
message into the peer-to-peer infrastructure for your recipient to pick
up (with the hash of their email address as the 'key' of course),
providing sufficient information and privileges such that they can
"authenticate" against your IMAP server or its online version,
and access your "sent mail" folder directly. using rsync-IMAP of
course :) wouldn't have it any other way. The advantage of this
approach is that the problem of SPAM almost entirely disappears,
as you are using an authenticated "pull" mechanism, not the
"push" mechanism that is SMTP. Further enhancements are to have
a hash of both the sender and the recipients email addresses
concatenated; for the recipient to perform regular "polling"
of all known senders; for a new recipient to "request
authorisation" to send email, just like has been done in every
single popular IM
system ever invented). Actually, an even better enhancement would
be to negotiate a random hash for use by each sender-recipient
combination, with the hashes generated at "communcation acceptance"
time aka "buddy authorisation" yukk hate that phrase).

My second favourite idea is the one where XML documents are treated as
"filesystems", which doesn't sound such a big deal except until you
recall that ODF
is an XML standard. Thus, the possibility exists to use rsync with a
double-VFS-plugin (on input as well as output) to perform real-time
peer-to-peer document editing (just like writely.com aka "Google
Docs"). Whilst I realise that it is a non-trivial task to make any
editor (whether it be Inkscape
or Koffice or any other) report and
recognise XML fragments as "modified" and "synchronised", at least a
convenient and efficient method would exist to perform the
document synchronisation, alleviating the need for the developers of
each of the editor projects to reinvent that wheel.

I just know that there are more things that could be done, such
as making the file-selection method part of the plugin architecture
(options such as --exclude, --include, --cvs-exclude and
--one-file-system), where instead of having these options you would
have a much more suitable set per-plugin. There must be far more uses
for rsync, with a VFS plugin layer, than I've been able to describe and
hint at, here.

remember: all patent law states that an "inventor" has the right to create a single instance of a patent for "personal use" such that they can experiment and create "new inventions". that right is enshrined into patent law, world-wide.

it just so happens that "downloading and compiling software" dove-tails nicely with this :)

so, if there's a problem with a single component being patented, heck - make it software-only distribution and provide a compile-up option on the user's device ha ha.

MD4 checksums are used as keys into a DHT p2p store. backups would be not only distributed but also "merged", only one copy of each block need be stored (indexed by MD4 checksum). idea contributed by phil - thanks!

Across the morning sky
All the birds are leaving.
How can they know
that it's time to go?
Before the winter fire
I'll still be dreaming.
I do not count the time
Who knows where the time goes...
who knows where the time goes?

Sad desserted shore
Your fickle friends are leaving
Ah, then you know...
that it's time for them to go,
But I will still be here.
I have no thought of leaving.
For I do not count the time.
Who knows where the time goes...
who knows where the time goes?

But I am not alone...
As long as my love is near me.
And I know it will be so...
'Til it's time to go.
All through the winter...
Until the birds begin to return in spring
I do not fear time
Who knows where the time goes...
who knows where the time goes ?

OfflineIMAP is a tool to simplify your e-mail reading. With OfflineIMAP, you can read the same mailbox from multiple computers. You get a current copy of your messages on each computer, and changes you make one place will be visible on all other systems. For instance, you can delete a message on your home computer, and it will appear deleted on your work computer as well. OfflineIMAP is also useful if you want to use a mail reader that does not have IMAP support, has poor IMAP support, or does not provide disconnected operation.

how about an rsync plugin that does an "in-memory" synchronisation of a hierarchical data structure? :)

for example, in Koffice, OpenOffice or InkScape or other editor, you would do a memory-to-memory synchronisation of the actual in-memory data structure that the word processor or editor is using (!)

it is of course essential to have "locking" of memory areas as part of the rsync VFS layer - which to be honest i am not sure if rsync _has_ "file locking" which would map to "data structure locking", i've not checked.

lkcl, i think it is quite easy to prove that the performance will *always* be around a factor 2 slower than just directly copying it because the algorithm requires full read of both copies to determine the delta in the first place. So that pretty much blows away any benefits over just copying.

Now, if

1) you have memory that hsa very asymmetric read/write timings in favour of reading, that might still be useful (I don't know of any memory technology that has that property)
2) if you actually do not wish to reduce memory manipulation but memory usage by storing deltas only (not full copies) and construct the 'revisions' on the fly, this may be worth some

At 2) I believe it is common in the kind of application you mentioned that there is alraedy a Command pattern in place to cater for undo. This Command history (tree?!) can be used as a de-facto delta-tree to reduce the memory consumption. I suppose this is exactly what happens (I suppose the Command-delta-tree could be viewed as a transaction journal against the 'committed' coy of your inmemory data structure).

It occurs to me that Google will send some people to this article who will find lkcl's ideas
fantastic, but actually were looking for advice on how to get rsync to do what they want at all.

Andrej Bauer has written just such a guide, Remote Backup with Secure Shell and
Rsync. It does not explain rsyncs huge array of shell options, instead it introduces a
sh script written by John Langford, famous for his work on applying dimensional analysis to
data mining, and describes good practice for using it. Once you've got your own rsync setup
working, you'll be able to appreciate the discussion here of VM layers for rsync and rsync-alikes
at so much deeper a level...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser
code is live. It needs further work but already handles most
markup better than the original parser.