WEP 102 - Partial Sync Download

Introduction and Rationale

When a user sets up Weave on a new machine and syncs existing data from the server, it's quite likely that there will be many records for engines to handle. Processing every record can take a long time and prevent the user from using the browser, but arguably, the user should be good-to-go with just a portion of all that data.

Users can also end up with many records to download in other situations such as not having Weave running for a while or a client just uploaded a bunch of records perhaps by importing data.

This WEP describes how Weave would decide what subset of data to pull from the server and how to continue fetching the rest of the data later.

Proposal

Instead of requesting all data from the server in "depthindex" sort for tree structures, request only for the first 100 records sorted by "interestingness".

old: GET .../collection?newer=1000&full=1&sort=depthindex
new: GET .../collection?newer=1000&full=1&sort=index&limit=100

As we incrementally parse the incoming records, keep track of the GUIDs that were processed as well as the server timestamp of the request.

If we processed 100 records, it's quite likely that there are more records that the server didn't send because we limited to 100. So we then get the full list of GUIDs by sending a similar request from before except don't ask for full records or limiting and additionally ignore changes since fetching the original 100.

GET .../collection?newer=1000&sort=index&older=1234

With this list of GUIDs, remove any that we've already processed in the original 100, and what remains is a list of outstanding GUIDs in order of interestingness.

If there are any GUIDs that are waiting to be fetched, we can then remove the first 50 records of this list then request and apply those full records:

GET .../collection?full=1&sort=index&ids=guid0,guid1,...,guid49

The remaining list of GUIDs is saved so that future syncs will request for records in addition to the normal "get 100".

If a future sync happens to get 100 items and has new "to-be-processed" items in addition to an existing list, the two lists of GUIDs need to be merged. A couple of simple implementations is 1) prepend the incoming list to the existing list or 2) interleave the incoming and existing list. This favors recently-incoming records over the old-uninteresting items, but isn't anything new as we already process incoming records first before checking the list.

New Opportunities

Dynamically choose how many records to download based on machine

Only sync a certain number of interesting records for mobile devices

Don't even bother with uninteresting records

Other Proposals

Another idea tossed around was to use the paging mechanism provided by the server, so requests would look like ?limit=100 and then ?limit=100&offset=100. However, this approach would need to correctly handle cases where items were added or removed from previous "pages".

A number of complicated heuristics were discussed to detect mismatched page positions and then refetch previous pages, but they didn't guarantee that no holes would be missed. So an additional pass would be required to make sure no items were missed after "finishing" the initial sync.