Since it seems that the “arXiv on your hard drive” is dead I’ve been thinking a bit about if there is a better way to achieve the goal of distributing archives of the arXiv.

One thing I liked about the “arXiv on your hard drive” was that it used BitTorrent. This could alleviate some of the bandwidth pain associated with distributing the arXiv widely. But of course, one of the problems with using Torrents to distribute the arXiv is that, well, the arXiv changes daily! One solution to this is to update the torrent periodically, but in these go-go times this seems wrong. It seems to me that what we need is a BitTorrent-like protocol for collections that periodically get updated. A seeder could then update its collection and propagate only these new results to other hosts. Does anyone know if such a technology exists? A quick scan didn’t locate anything.

Of course then one would have to convince the arXiv folks to go along with this, but it would seem to me that the bandwidth costs for them could be made really fairly minimal.

simple solution, represent the database as a distributed version control repository! Then you just pull the patches for whats new since you last peaked, or perhaps just from whatsnew in some subset of categories, (ie perhaps there is a repo for each category, with each batch of new papers / revisions being a patch, and an everything repo which pulls from each of those), and so forth!

Now the only question then is which of git, mercurial, bazaar and darcs is most appropriate? Or maybe which one is the right starting design point?

By his noble request, I humbly submit before you my own unworthy analysis of the use of distributed version control systems to solve this problem, so that you may waste your valuable time reading it.

In my likely to be erroneous opinion, the decision of whether to use Subversion versus BitTorrent versus a distributed version control system versus rsync for distributing “diffs” of the current arXiv miss the greater problem, which is that there are actually *two* use cases:

(1) Updating repositories so that they include the new papers released on the arXiv

(2) Downloading the entire arXiv to obtain the initial copy of the repository

The design of any such distribution system depends on to what extent use case (2) will be supported, and how. To see why, consider that diffs are several orders of magnitude smaller than the whole thing, and so the bandwidth of just a single instance of case (2) will likely outweigh the total bandwidth consumed by the whole community of people using case (1).

It is not clear to me what the best way is to handle use case (2), but I can think of a few options:

A) Don’t let people download the whole thing unless they promise to become a mirror. Chances are not many people will want it anyway, if they really want it without setting up a mirror then then they could, for example, pay to get a copy of it sent via mail.

B) Use BitTorrent so that people downloading are forced to help sharing while they are downloading. The tricky thing here is that we’d realistically need to work out a system for distributing the whole thing as a series of big chunks, since bigger chunks = less overhead = more time spent downloading (and thus assisting uploading) each chunk, but smaller chunks = easier to create them as new papers are added to the repository; no matter how this is done, though, it requires not only creating a system to create the chunks but also doubles the required storage so that files can be stored both in the filesystem and inside a chunk. Furthermore, unless many people are downloading it simultaneously, this doesn’t really help with adding sharers to the network, although it does provide a convenient way to distribute the load across many computers and also to let people join the network.

C) Suck up the bandwidth hit and use a system of rotating mirrors or something to handle it.

Furthermore, even if we just choose not to worry about this case and just focus on use case (1), most ways of giving people access to synchronize repositories become ways that somebody can download the whole thing by starting with an empty repository. For this reason, for example, it is unlikely that we will be able to talk the arXiv into setting up a public rsync/git/subversion/etc. server since then it will make it easy for people to suck up their bandwidth by downloading copies of the repository. (Incidentally, for this very reason, I think that the best we can hope from arXiv is that they will either (A) create a rsync server that only a few public mirrors can access, or (B) only publicly serve the most recent updates to the database.) Of course, one easy solution to this is to kill all connections that last longer than 10 minutes.

Anyway, the point that I am really getting at with all of this is that before deciding on *how* we distribute the arXiv to everyone, it is important to figure out exactly *what* it is that we want to distribute and to *whom* we are going to distribute it. Once we’ve figured out how much of the first few terabytes of data it is that we want to ship around, the last few tens of megabytes of “diffs” is the easy part.

Having said all this, for the “diff” part my personal recommendation would be rsync, since it is designed precisely to be as fast as possible in scanning for differences between repositories and transferring only what is needed; furthermore, it easily lets people pick and choose the subdirectories that they want to fetch and/or update. A distributed version control system offers similar functionality but is optimized for the case when you want to haul around a whole history of all changes to the repository so that you can jump to any point, which adds space for effectively extra copies of the files as well as overhead of more things to compare. If you were to go this route, git is the fastest and probably would scale the best for a repository of this massive size. However, darcs has the advantage that if you make a patch for each file you could conceivable tag groups of patches allowing you to make it easy to download all of the files in a given subcategory, which is less convenient with the rsync method since using a directory tree to categorize files makes each file only belong to one category; it has been observed to have performance problems with large repositories, but such repositories also have “non-commuting” patches that are relatively expensive to deal with, and nearly all of the patches in our imagined repository would commute so we might actually get good performance from it.

As a side note, to get a sense of what has been done before with this kind of problem, consider the following: The Gentoo linux distribution has something called the “portage” tree, which is a big tree containing lots of little scripts for automatically downloading, configuring, and installing supported programs from the internet. Users first grab a tarball with a snapshot of the tree from one of the mirrors, and then periodically “rsync” to update the tree to the current version, as scripts are updated very frequently. Although there are only a few mirrors, there are a bunch of people who have volunteered their computers as rsync servers, and when people update their tree they connect to the host “rsync.gentoo.org” which randomly picks one of them. Of course, the difference for them is that their full tree zipped is only 35 megabytes, but the arXiv is several terabytes, so it isn’t clear to what extent this approach could be made similarly to work for us.

Wow, the Preview feature of this comment box is horribly broken! It totally failed to prevent me from realizing that I’d slipped and typed “Unworthy associates” rather than “Honorable associates” as I had meant to type (i.e., I was absent-mindedly thinking ahead to the next sentence when I was about to type “unworthy”), which completely transforms the tone from being quirkily humble to randomly offensive. I apologize for my computer’s failure!

An alternative to DVCS might be something like CouchDB, which features replication/synchronisation of the database. It also has powerful querying language, through its views. And the best thing is, those views are also synchronised with the database (they’re just documents in the DB), so people could also share their queries.

Again it would have the ‘initial mirror’ problem, but once that it is setup, it is intrinsically distributed and any mirror can be used for synchronisation.

How many people really want to host the entire arXiv on their local machine, other than for bragging rights? Sure it would mean that you never have to wait to download a particular paper, provided that you update each day, but just how quick would local search be without using some distributed filesystem over a dedicated cluster?

Rather, I think a more useful setup would be for readers to have easy access to the papers they are interested in, in addition to hosting a small subset of arXiv.org. Given enough people involved, the whole of the arXiv can collectively be shared, with hopefully moderate costs involved during both setup and for any individual’s overhead.

For example, the first time someone wants to download some paper, a quick query is made to arXiv.org to determine how many versions are present. These records are then downloaded, either from arXiv.org or from another listed server if it is already present elsewhere. Then, if the user is willing to be a host for others, all of the papers within that block of (say) a hundred are also downloaded. This is simpler with the new numbering system, but with the old format, it can download up to ninety-nine neighboring papers within the particular category. In order to stay updated, this new server can either view each day’s listings, looking particularly at replacements, or it can simply check whether the number of versions listed on arXiv.org is consistent with its local value.

This same approach can be used for daily updates as well. Once a few people have downloaded the day’s new papers, they can then host most of the new inquiries, and it can be spread out from there. This does suggest that not everyone should try to update immediately; otherwise, everyone would end up downloading directly from arXiv.org. For replacements, the number of versions listed on arXiv.org should always be checked, and if earlier versions are not present locally, then again these would be updated either from arXiv.org or from a listed server (whoever is hosting the block of hundred papers that includes the queried one).

It is sad to bring up, but there is an issue of data integrity as well once one relies on other than the central authority for what is “official.” It would probably be necessary for arXiv.org to provide easy access to hash values of each HTML page and PDF file, so that any user can verify that what they have downloaded from another source is authentic. This would be a simple way to identify any non-malicious corruption of data (whether on a local machine or during transport), and it would at least raise the bar in protecting against malicious rewriting of someone’s paper/abstract. This would involve a connection to arXiv.org for every download from elsewhere, but it should be a very short message (probably less than 1 KB).

One final note: should emails for submitters be protected from being available in a distributed source, given that arXiv.org requires logging in to a user account to view them currently?

“How many people really want to host the entire arXiv on their local machine, other than for bragging rights? ”

You raise a good question. However I think there is a ton of value in allowing this. First of all it should be noted that because of the arXiv’s robot policy, nothing like this is currently possible. In other words over 15 years of research, while accessible on a nibble basis, is not really accessible.

Suppose for instance you wanted to work on applications which use the corpus of the arXiv. Today you couldn’t do it without going through a lot of hoops. Suppose for instance that the publication model _changes_, i.e. what if a format arises which is a superior for scientific documents, including things like code/data/etc in more raw form. You might want access to data like this across multiple experiments, for example. Plus while it is true that we are pretty much connected all the time, there are still times when we are away from the beast known as a live internet connection. Okay, yeah, I’ve flown too much this year 🙂 Another reason is that the one paper at a time system makes it very hard to develop systematic software for, say, bibliographies, etc.

Does everyone want the arXiv on their hard drive. Probably not. But I think there is a growing group who could do interesting things with it and this justifies at least consideration of the idea. In principle yes things can be done by download, but, currently the download policy at the arXiv is limited by bandwidth constraints. Does it have to be this way?

I do agree that data integrity is important, but don’t understand why a central repository cannot continue to be in charge of this. If the “unofficial” repositories are corrupted, this shouldn’t change the status of the official repository (and I’d guess that those running any unofficial repository would have just as much stake in keeping the data sound.) I would also point out that right now we have no real guarantee of the data integrity of the arXiv. Do you know that they haven’t corrupted one of your papers? (That would be cool. Well it would suck but it would be interesting!)