Incremental Sync in ownCloud

Incremental Sync is probably the feature that most people ask, or even sometimes cry for. Recently there was another wave of discussion about ownCloud is doing incremental sync or not.

I will try again (as in this issue) to explain why we decided to slowing that feature. Slowing means that it will be done later, not never, as it was stated. It is just that we think that other things benefit the whole idea of ownCloud more. That has plain technical reasons. Let’s dive a bit into.

RSync is great

Nobody will object here. In a nutshell, this is how rsync works: There is a file on the client and on the server. The idea is to not transfer the entire file from one side to the other if either side changes, but only the parts that have changed.

The original rsync does that by chopping the file to blocks of a given size and calculating a checksum of each of the blocks. The list of checksums is sent to the server and – here’s the trick – the server looks at its version of the file and for each of the checksum in the list, it seeks if it finds the same block in the file. That will often not be at the same position in the file, but maybe somewhere else. That is done for each block, and finally the server will work out the information of which parts of the file are existing and which are not and have to be sent by the client.

By way of this clever algorithm, we will just have to transmit a very small fraction of the changed file, because most content did not change. And that is what we want! Yeah!

Mission accomplished? No, not really. While there is basically nothing wrong with the idea in general, there is a severe architectural downside. The rsync algorithm depends on a strong server component which, for each file, searches around and calculates checksums. In an environment where we potentially have a lot of clients connecting to one server that would create a huge load on it which we need to avoid. So what if instead of putting the burden on the server’s shoulder, we could make the clients take the responsibility?

And guess what, there has been somebody thinking about that before and he says:

Use ZSync for this!

ZSync basically turns the idea of rsync upside down and shifts the calculation of checksums away from the server and onto the clients. That means that with zsync, the server can keep a static list of checksums for every block specific to a version of a file. The list can for example be computed along the upload of the file to the server. From that point it does not change, as long as the file does not change. That means less computation work for the server, and maybe this job can also put into the client.

So far that sounds cool (even though some questions remain) and sounds like something that can help us.

Unfortunately, the approach does not work very well for compressed files. The reason is that if a file gets compressed, even if only a couple of bytes in the original file change, the compression algorithm usually changes a lot all over the entire file. As a result, the zsync algorithm can only compute a comparably large diff. Given the cost of computation that can turn inefficient quickly.

“But who uses compressed files?” you might argue. The problem is that almost each and every of the files in everyday life are stored compressed. This is for example true for Microsoft Office files and the Open Document files produced by LibreOffice and Apache OpenOffice. They are really renamed ZIP containers, that hold the documents with all its embedded files, etc.

Now of course you will reply that zsync has an improved algorithm for compressed files. Yes, true, that is a great thing. However, it involves that the compressed file gets uncompressed to be worked on by zsync. Afterwards it is compressed again. And that is the problem: As common compressors do not leave a hint behind _how_ the file was compressed, it is not possible to reliably recreate a file that is equivalent to the original one. How will apps react on a file that has changed its compression scheme?

Results

As said above, yes, we will at one point of time implement something along the zsync algorithm. The explanations above should show however, that at the current state of ownCloud, other features will improve ownClouds performance, stability and convenience more. And that is the important thing for us, more than pleasing the loudest barking dogs.

Here is a rough outline of how I would move on on this, open for your suggestions and critique:

The zsync algorithm is designed to improve downloads. We need it for both up- and downloads, and it needs to be thought through if that is also possible. For the server side functionality, there are a couple of open questions which carefully have to be investigated. Preferably an app can be written that provides the handling of the zsync checksum lists. That has to be clarified and discussed, and that will take a while.

But as outlined above, this idea is only clever for a limited amount of file types. So what I would suggest first is that we get an idea of the file types users usually store in their ownCloud, so that we can do a validated estimate on how this feature helps. I will follow up on this first step.

Probably I do not know enough about the owncloud implementation, in fact i know almost nothing – i came across the blog post on planetkde.

You put forward some good discussion on why incremental file transfer isn’t a priority for some specific (common?) use cases. This does not mean it is never worthwhile, and it is possible to conceive of cases where handing off the sync transfer of changed files to rsync would be ideal (syncing the encyclopedia you are editing in a monolithic text file over 14.4k dial-up perhaps?).

Would it be possible to make the transfer protocol pluggable? Eg. when sync of file content is required and a condition is met, pass off the sync to an external program in a temp location. Owncloud picks up the (local) synced tempfile if the external process exits successfully, tests it matches and completes the sync. I would expect a performance hit, particularly server side if enabled with a complex setup and many users. On the other hand, if optional it would be near zero cost and it would provide a neat testing platform for people with such oddball use cases to test the benefit of various transfer protocols.

I admit its not a priority for me, and dont _expect_ the feature to turn up anytime soon. Setting up the server to have an arbitrary transfer utilities to magic a local updated copy of a file probably has security repercussions i havent considered too… That said the idea is to cheaply provide a way for people to experiment with the network transfer if it is important to them.

The question is: for what files is this delta sync relevant… Because it does not work very well on compressed or encrypted files (changing one bit changes everything) and is useless on small files (faster to just sync them). So, is it useful?

* it isn’t for most music (compressed)
* it isn’t for most pictures (compressed)
* it isn’t for most movies (compressed)
* it isn’t for most documents (too small and/or compressed)

Ask yourself. What files are 1. bigger than 20 mb and 2. not compressed or encrypted?

There are two use cases that I can think off:

* biology researchers. They have massive text files with ATCGACTTGC in them. Not sure if they change them, though…
* VM images. Most of those could benefit from incremental sync.

Now if you have multiple multi-GB VM images you’d like to be synced accross devices, you would benefit. And if you’re a biology researcher who changes his big text files with genetic data. But that is, what, 3% of our user base?

I am not saying it can’t work better than sending the full file. You can start to special-case stuff. You can decompress files and look at the content, see what changed. If you can predict what the compression algorithm does, you can know how to chunk up the file and what to send over. But you need to now how to deal with every file type under the sun – or at least those often used. You can’t use the same, generic algorithm on all – it won’t work.

For dropbox and friends, this isn’t a huge problem:
* They can special case all kinds of data because they just throw 100 people at the problem.
* They have massive amounts of storage and a business reason to to use it super efficiently. So they chunk everything up, every piece of a file that matches does not have to be uploaded. With 250 million users, you’ve got plenty of overlapping data and thus you can save a lot of syncing. Doesn’t work for a private ownCloud server, unfortunately…
* They have one client, no need for supporting generic webDAV or other clients.
* They control the storage protocol – ownCloud usally doesn’t! ownCloud supports external storage – and as you can’t expect your FTP server to deal with chunks, making this delta syncing work with external storage is both very hard and probably not half as beneficial as you’d hope).

Other open source file sync and share projects suffer from the same issues (except the last one, ownCloud’s external storage is, afaik, unique). Some have delta-sync, but I would think it’s more of a checkbox on a feature list than that it really helps with a lot of files…

So, we could work on making this work. It can probably save bandwidth on some files, paying a price in much increased server load and client load. We could work on this feature.

But we have many, MANY users who benefit from features like favoriting, selectively syncing files, sharing files with a right click in a file manager, showing the sharing status in the file manager and the many other features the client did introduce already or that the client team is working on.

So, help us decide: what feature to axe because selective sync, benefitting a very small number of users, is more important?

Personally, I would prefer to be able to roll back changes to files with a right click in the file manager over selective sync. And I would prefer to be able to drag a file on the ownCloud systray icon and get a share link in my clipboard ready do share over selective sync. And those features are probably not even on the roadmap yet, and I would push back selective sync for them.

Would you?

jhein

February 11, 2015 at 00:01

Hi, maybe we are talking about different things or I understood something wrong. I was talking about block level sync. Before I moved away from dropbox, I had truecrypt containers in my dropbox to provide an extra layer of security. Or actually a layer of security 🙂 One of them was 20GB. However, when I changed anything within the container and unmounted it, dropbox only uploaded the parts of that container that were changed, not the whole container. It was usually finished after a few seconds. Uploading the whole thing again would have taken much longer.

hi jhein, I am curious to know what type of containers you are using now, still truecrypt (after it was discontinued)? I am a truecrypt user as well. I have also tested the block level update of Dropbox. I tested hubic & onedrive too, it is not supported. Not tested other services yet.

Presumably you could take advantage of knowledge of the container format itself. Instead of computing hashes over the chunks of the container, compute chunks of the components (i.e. files like stylesheets) which are contained within it. This way, you can detect which components have changed within the container (more often the markup instead of stylesheets), allowing you to extract & send just that component and reassemble an up to date version on the other side. The chunks bit might still be useful for the component parts, think compressed SVGs which are simply huge XMLs.

Obviously this doesn’t apply to lossy file formats. On the other hand, in the case of movies you could use knowledge of the container file format again to determine at least *what* stream changed, if any, and possibly roughly at what point… which might well present a huge performance win (e.g. the addition/removal of an embedded subtitle to a mkv, leaving the sound & video unaltered).

I am a biology researcher, great to hear that we are 3% of the owncloud userbase;), and I don’t have files of several Go with sequences, more Mo and as you said I rarely modify them, I will instead create new file from them (often smaller). Also the biggest files I tend to have and modify a lot are presentation files with movies and lot of pictures in it and as you said they will not benefit from delta sync… So yes please bring other great features first! I would love to have the rollback options in the file manager.

“So what I would suggest first is that we get an idea of the file types users usually store in their ownCloud”. What about creating an app that could report an anonymized percentage of files type data, a little bit like the popcon package in the linux distribution? Of course this would be volunteer and only temporary. It may feel like a big privacy intrusion for some but i would agree to share such data as long as it is anonym and that the file content/name is not revealed.

I fired my owncloud server up (which is half testing and half backup machine) and was dismayed to discover my desktop client struggling to re-transferred my whole photo library over my flakey wifi. This reminded me why I hadnt used owncloud much, as a couple of deleted dotfiles and a bit of a shuffle of folder structure had started re-transfer of a lot of unchanged data.

However, an upgrade or two (client & server) later, and (trying again) i was delighted to find owncloud realised it already had identical files for most things and moved on to transfering the last few months of new content.

I now wonder how many ‘incremental diff’ requests were actually just people wanting the client to have the smarts to compare and skip transfer of files that had an updated timestamp or that owncloud was rediscovering after the client was reconfigured…. it certainly is a huge improvement!

Nice post, dragotin!
In order to answer the very last point of your post, it would be nice to have some sort of telemetry in ownCloud. Basically the same thing they have in Firefox, they collect and measure non-personal information and send it to a team who that does analysis on it. Of course, the measurements taken would be visible to the admin of the ownCloud instance and there would be a big opt-out check-box in the admin settings. Just an idea, maybe you already do something similar in the enterprise edition.

“what I would suggest first is that we get an idea of the file types users usually store in their ownCloud, so that we can do a validated estimate on how this feature helps”

Just to give you an idea, I store lots and lots of virtual machine files on my Dropbox, ranging from 20GB to 80GB++ in size. I need those VMs on my Desktop PCs, and on a couple of laptops.

When I boot a VM and change a single byte Dropbox syncs just that single block, out of thousands and thousands of blocks, and I almost immediately see those changes reflected on all the devices I have Dropbox installed on.

I don’t think having virtual machine files on a cloud storage is something of a very exotic scenario these days.

It goes to show just how out-of-touch the core ownCloud devs are with their community. Almost half of all of ownCloud’s Bountysource bounties are for this one issue, and now you can’t even comment on this issue on Github because it is annoying one of the devs for him to hear from users that want this functionality. Then here is the discussion from core devs about whether or not we “really need” this feature. You can pretty much be straightforward and say “this will never happen” instead of “this may happen sometime” because even if the technical wherewithal exists inside the core dev clique (perhaps the central problem), it most certainly doesn’t exist in their minds as a priority. The project is just an enormous, behemoth LAMP application anyways. There are better solutions out there.

I wonder what the purpose of this post really is, but lets assume it is about your thoughts about the core devs priority of this feature. The answer is “Yes, it will happen some day” if that was unclear. I tried to explain that my personal believe is that there are more important things for the project to do before.

If that means for you that the core dev clique does not see a priority in this, I can only invite you to join the group and find out. We’re open, especially to pull requests.

I agree on the point that before working on that feature, we need to know whether it will be useful to 40% or 2% of the current user base of ownCloud.

But there is also the notion of where ownCloud should be headed, what users they want to target next : are OC devs content with the current user base, or do they want to expand the use of ownCloud to a broader audience ?
Why is this critical ? Because I think that while TrueCrypt or VM containers are indeed common amongst a somewhat “geeky” user base (which I’m a part of, and which IMHO makes most of the current user base), they are BEYOND RARE for your average Internet user.
It is easy to forget that these tools need technical knowledge that 99% of the population that uses Internet do not possess. Those users will have photos, music, videos, office documents, and little more ; all file types that were established to gain nothing from delta-sync.

So really, it’s all about what are the plans for ownCloud in the future.

My personal opinion is that ownCloud SHOULD strive for broader adoption (not just in number but also in terms of the kind of user it gets), so the current plans to delay delta-sync are fine with me, because it is best to focus on features that will appeal to a wider audience.

Also, just to clear any misunderstanding : I’m not saying that the technical-savvy users should be dismissed as valuable users.
My point is more that we need to stop producing “geek software for the geeks” ; software that protects privacy and freedom online is especially prone to this. This leaves the vast majority of Internet users alone in a very hostile ecosystem for their privacy. Everyone deserves a product like ownCloud, that better protect online privacy, and so the focus should be on attracting those users.

The advanced users can manage on their own anyway, as they have the technical knowledge to work around any product limitation.
For example : ownCloud doesn’t do delta-sync ? Fine, sync your true crypt files to dropbox (or another provider that do delta-sync) and only them. Your regular files can go to ownCloud. This is a minor adjustment to the setup.

Just like any driver can expect a car safe to drive, not just the professional mechanics, all users of online services should have access to a safe ecosystem, not just the people that invested hours and hours in learning how computers work.

Since years I’m using Novell iFolder to sync all my documents (~40GB) between several computers. iFolder is using delta sync and it save me a lot of bandwith. Every time I change something within an Powerpoint presentation I can see in the sync log “saved ~80-90% of sync size). iFolder is OpenSource and a programmer could look into the code to see how it works. Sadly Novell has stopped to update iFolder and therefore I’m lookin for an alternative solution. ownCloud would fit me need if it would offer delta sync.

Wow, after years of not using ownCloud, I was wondering if this was finally implemented, so Google brought me here. Kinda shocking it still isn’t! That was for me a deal breaker at the time, and it still is. I want to handle large TrueCrypt (well, today it’s VeraCrypt) containers without uploading the whole file each time when just some blocks are changed inside it. As mentioned by many others, Dropbox handles these kind of files just fine, and does it IIRC from the beginning.

I do think such encrypted containers are VERY important for online storage services, since there is little to no control what happens with your data on the other end.

https://github.com/owncloud/core/issues/18716 is going in before 9.0, the first step towards incremental sync. Once the client can compute and check checksums, we can look into a rolling checksum. It won’t be done next week as it continues to be a minority use case and, more importantly, it simply needs somebody who wants to do the work (or somebody to pay) and nobody has stepped up yet for either.

Well actually i and my clients find this an absolute necessity, surely if Dropbox is Ownclouds competition / alternative, the features need to match if not surpass. Lots of clients are using POP3 email accounts for example, as IMAP accounts take up a load of server space. This means .PST files are huge on a lot of clients 4gigs up, and Email is a ESSENTIAL part of anyones backup. Hence OwnCloud can not effectively back up Outlook PST files.

No offense, but if you use a sync tool for backup you might not understand what backup is… ownCloud synchronizes changes, that means that if you lose your backup locally ownCloud will delete it from the server, too. I suggest using a backup tool for making backups.

If you have problems with IMAP taking up a lot of server space and replace it with using PST files on ownCloud or dropbox taking an equally big amount of space on a different server perhaps you should think about using a email server which deals with large amounts of server space. Then users don’t need to set up and sync a PST file and move emails into it and you can easily take care of backup.

Alternatively, I know there are people who don’t want to move from Exchange but, like you, have trouble with the fact that it is a poorly scaling and exceedingly expensive solution when you have lots of users and data. A way around that is to replace attachments with ownCloud storage directly on the exchange server. If you are interested in that solution, contact ownCloud sales.

Thank you for your reply Jos, perhaps you dont understand & maybe i have not explained the problems in the market clearly. We (the market) are indeed looking for a sync solution not a backup solution. Simply put if we use a large file for example, IE .zip folders, PST files the second any part of it is changed, Owncloud starts all over from the beginning to reupload the file as if it was ALL changed. Simply put regardless of PST files or any other. OwnCloud does not only sync the part that was changed. Im sure you would agree that Dropbox is a sync solution?….and there marketing states this very feature “Before transferring a file, we compare the new file to the previous version and only send the piece of the file that changed. This is called a “binary diff” and works on any file type. Dropbox compresses files (without any loss of data or quality) before transferring them as well. This way, you also never have to worry about Dropbox re-uploading a file or wasting bandwidth.”
IE When is Owncloud going to get up to scratch with the market and competitor standards and requirements ???

@RyanDM1983 have you actually read my post? Especially the first paragraph of the chapter “Results”? I am happy to call it back to your mind: Yes, we can do delta sync, my post explains how is can be done in principal. But at the current state of ownCloud syncing, other changes will benefit the users experience more, as delta sync does not work well with compressed files. So far the bottom line in the blogpost.

Just stay tuned, and always remember that in open source projects, work is much more helpful then good advices. So how about you do some investigation on how Dropbox does their “binary diff”? I would be so curios 😉

@dragotin, i appreciate all the work being done on owncloud and forums such as this one. The working of server apps such as Owncloud is not my forte,. I buy and resell the Owncloud product. At best what i’m reporting on is the direct response coming from the market. The bottom line is Owncloud vs Dropbox. Dropbox syncs real time Owncloud does not. In layman’s terms if i upload a file for 3 days to owncloud, once fully synced i then make a single change after, Owncloud takes 3 days and uses up all my data & bandwidth again to make this change. Dropbox does not, it takes 2 minutes to upload the change???!!! I wouldnt and neither does our target market call this a slow feature its more of a basic necessity.

From my understanding Binary DIFF is to compare two sets of digital data in order to determine if they are 100% identical, binary compares are often used when programs create or modify data and need to make sure a newly generated file is the same as the old. How this is actually done must be possible ?…..Dropbox are doing it. Ive tested and compared it. Thanks for all the work.

having read the rationale behind not leaping to do differential sync, i’d say that this is in fact better handled by a proxy in front of the owncloud box. in commercial land, this would be a Steelhead, Silverpeak, WAAZ or Bluecoat box. In open source land, i’m not sure there’s anything there yet, but it’s a niche that needs to be filled.

Essentially, leave the owncloud code alone and focus on things relevant to that, and (in true unix fashion) let the network bandwidth be compressed/de-duped by a platform dedicated to that.

Unfortunately, the inability to sync deltas is a deal-breaker for me as well.. I need to use TrueCrypt Containers, and re-xfering the entire file every time doesn’t cut it.
I understand there are challenges. I hope OwnCloud will come out with this feature soon.

Jethro, your opinion on letting the a ‘proxy’ handle it doesn’t address every use-case efficiently. Although I would like to understand how to do it in Steelhead or Bluecoat.

I don’t use OwnCloud, I ended up here during a search for google drive delta syncing options. However, I think I can provide the devs with a less niche use case for delta sync. I am an architect and the files that we create for our 3D building models, graphics, and renderings can be rather large >100mb and we frequently change very small parts of these files (such as correcting spelling mistakes in the notes for our 3D building models). Without delta sync these frequent small changes require the whole file to be uploaded and re-downloaded to whatever computer we are syncing with.

I frequently collaborate with colleagues and consultants in other states and this delay is causing us issues. We used to use Dropbox for our file sharing/syncing, but switched to Google Drive as their service has much better file/folder access controls (Dropbox requires that lower level folders inherit the sharing permissions of their parents. You can’t share a parent folder with A, B, and C and then a child folder with only A and C) . However, Google Drive does not support delta or block level syncing, and this is now becoming an issue for us as the delay between one person making changes to a file and the other person receiving those changes on their computer has increased twenty-fold. With Dropbox, minor to medium changes to a large file would be synced in about 1-5 mins. Now with Google Drive we are seeing 10-20 minute delays. We can’t go back to Dropbox as we need the file access controls, but we are struggling to deal with the frustration of the slow sync times required for Gdrive to sync the entire 150MB file.

Now if OwnCloud can offer both these features of delta/block sync and strict access control to shared folders, I might consider switching to using this service. This would be a killer feature set as it combines the best of both Dropbox and Google Drive.