A Censorship-Resistant Web

Imagine someone put a document up at http://pentagonpapers.com/volumes/1.html that a) some people want to read and b) some people want to keep you from reading.

Step one: How it works now

On the current Web, the way you request such a document is like this:

You ask one of your pre-programmed root servers who is in charge of .com

They respond with VeriSign, so you ask VeriSign who is in charge of pentagonpapers.com

They respond with Acme ISP, so you ask ACME ISP where to find pentagonpapers.com

It responds with an IP address, so you request the page from that IP

The censors can ask VeriSign to give them control of pentagonpapers.com, they can try to shut down Acme ISP, they can try to prevent you from getting hosting, and they can try to shut down your IP. All of these have been used recently, with some success. You need a backup plan.

Let’s imagine we want this URL to resolve in an uncensorable way. How would we do it?

Step one: Domain name ownership

First we would have a certificate authority (CA) which would sign statements of the form: “As of [DATE], [DOMAIN NAME] is owned by the holder of [PUBLIC KEY].” (Let’s call this a certificate.) Conveniently, there’s already a whole industry of trusted businesses that make these statements — they’re called SSL certificates.

The problem is that CAs are presumably just as subject to attack as the registrars (in fact, in some cases they are the registrars!). One possibility is to set up a certificate authority that will not sign such statements for people attempting to engage in censorship. It seems probable that such a policy would be protected by the First Amendment in the US. However, “people engaging in censorship” is a somewhat subjective notion. Also, it’s always possible a court could order the certificate authority to turn over the private signing key (or the key could be obtained in some other way).

Another possibility is some kind of “rollback UI”. If you know vaguely when the censorship attempts started, you can only trust certifications made before that date. This is a somewhat difficult feature to implement in a way that makes sense to users, though. The best case scenario is one in which the user can clearly distinguish between a censored and uncensored page. In that case, if the page appears censored they can hit a “go back a month” button and the system will only trust certifications made more than a month prior to the certification it’s currently using. The user can hit this button repeatedly until they get an uncensored version of the page.

Step two: Web page authentication

Next the owner of the website will need to sign statements of the form “The content of [URL] had the hash [HASH] on [DATE].” (Let’s call this an authenticator.) Now given a page, a corresponding valid authenticator, and a corresponding valid certificate (call this trio an authentic page), browsers can safely display a page even if it can’t access the actual web server. The digital signatures work together to prove that the page is what the website owner wanted to publish. If a browser gets back multiple authentic pages, it can display the latest one (modulo the effects of the “go back a month” button).

Step three: Getting authentic pages

Set up a series of domain-to-certificate servers. These servers take a domain names (e.g. pentagonpapers.com) and returns back any certificates for it. Certificates can be obtained by crawling the Web or by being submitted by website owners or by being submitted by the CAs themselves.

Set up a series of URL-to-hash servers. These servers take a URL and return back any valid authenticators for that URL. Authenticators are very small, so each URL-to-hash server can probably store all of them. If spam becomes a problem, a little bit of hashcash could be required for storage. Website owners submit their authenticators to the URL-to-hash servers.

Set up a series of hash-to-URL servers. These servers take a hash and return a series of URLs which can be dereferenced in the normal way to obtain a file with that hash. People can submit hash-to-URL mappings to these servers and they can attempt to automatically verify them by downloading the file and seeing if the hash matches.12 Again, these mappings are very small so each server can probably store all of them.3

Then there are a series of servers that host controversial files. Perhaps they saved a copy before the site was censored, perhaps they received it thru some out-of-band channel4. However they got it, they put them up on their website and then submit the URL to the hash-to-URL servers. Meanwhile, the site publisher submits an authenticator to the URL-to-hash servers.

Now, if a browser cannot obtain the pentagonpapers page through normal means it can:

Ask each domain-to-cetificate server it knows for certificates for pentagonpapers.com

This can be implemented through a browser plugin that you click when a page appears to be unavailable. If it takes off, maybe it can be built in to browsers. (While I’ve been assuming the worst-case-scenario of censorship here, the system would be equally useful for sites that are just down because their servers couldn’t handle the load or some other innocent failure.)

This system should work unless our adversary can censor every well-known CA, every well-known URL-to-hash server, every well-known hash-to-URL server, or every alternative URL.

Step four: Beyond the Web

We can help ensure this by operating at least one of each as a Tor hidden service. Because the operator of the service is anonymous, they are immune to legal threats.6 If the user doesn’t have Tor, they can access them through tor2web.org.

Similarly, if you know your document is going to get censored, you can skip steps 1 and 2. Instead of distributing a pentagonpapers.com URL which is going to go down, you can just distribute the hash. For users whose browsers don’t support this system, you can embed the hash in a URL like:

https://hash2url.org/sha1/284219ea93827cdd26f5a697112a029b515dc9a4

where hash2url.org is a hash-to-URL server that redirects you to a valid URL.

And, of course, if you somehow have access to a working P2P system, you can just it to obtain authentic pages.

Conclusions

What’s nice about this system is that it gets you censorship resistance without introducing anything wildly new. There are already certificate authorities. There are already hash-to-URL servers. There are already mirrors. There’s already Tor. (There’s already tor2web.) The only really new thing specific to censorship resistance is URL-to-hash servers of the form I described, but they’re very simple and hopefully uncontroversial.

There is some work to be done stitching all of these together and improving the UI, but unlike with some other censorship-resistance systems, there’s nothing you can point to as having no good purpose except for helping bad guys. It’s all pretty basic and generally useful stuff, just put together in a new way.

Any server will have finite bandwidth, so an attacker could try to fool the hash-to-URL server by submitting a URL which when dereferenced never stops sending the data. The hash-to-URL servers should stop after a certain limit and mark the URL as unverified due to max file size. If the server ever obtains a file whose size is under the limit with that hash, it can toss all such URLs. ↩

URLs can go out of date so perhaps upon receiving sufficient complaints about a URL being “bad”, the server should attempt to reverify. Again, hashcash can be used throughout to avoid spam. ↩

A possible protocol for the above two servers is provided in RFC 2169. ↩

I have ideas on how to automate this, naturally, but this essay is already far too long. ↩

Optional bonus: Use HTTP Range headers to download 1/n of the file from each of the n URLs. There are some circumstances where this could speed things up. Or maybe it’s just annoying. ↩

This moves the censorship weak link to the distribution of introduction points to hidden services.7 But instead of being published by a DHT, introduction points can be distributed through a flood protocol8. Or maybe the DHT can be modified so that there’s no obvious censorship point? ↩

The introduction points themselves can’t be censored because they don’t know who they’re talking to. (I think they do in the current implementation of Tor, but this doesn’t seem necessary. The hidden service can generate a new keypair for each introduction point and send the public key to the introduction point and to Alice.) ↩

Is this too chatty? Probably. But remember, it’s a last-case resort in some kind of insane police-state world where every country prevents people from running servers that give out the IP addresses of other servers that let you talk to a third server which will give you illegal content. ↩

Comments

I also got thinking about what you do when there are lots of potential takedown points. I was thinking one handy piece could be a client-side piece — a few static files that anybody can put on a server constitute a program that 1) tries a bunch of directory servers to find where a file with a particular hash is stored, 2) tries to download and check the hash of the file at each location, 3) displays the “winner,” the first document that hashes to the right value. (That takes cross-domain communication and crypto in JavaScript, but both are possible these days.) The handy thing is that the client piece can be hosted anywhere trivially, without hosting the actual censored content. The client could also come with a list of some hashes for interesting documents and some places to look for them.

Or a simpler version of that would be just one page that looks through a hardcoded list of URLs for a page whose contents authenticate, i.e., one that isn’t down yet. Just saves the user the trouble of testing a list of mirrors by hand, but that’s a real win when users want to find stuff easily.

Your system lets the publisher update the page, including when the update content isn’t known beforehand — e.g., when it’s breaking news instead of the next volume of the Pentagon Papers. That’s harder to solve with a document-hash-based system, but possible; you need to authenticate that the update is also by the original author. If the original document contains the author’s public key, or a hash of it, or a hash used for a one-time Lamport signature, you can use that.

(Not trying to argue about the merits of the two systems. One clear advantage of a system without a heavy client side is that it doesn’t depend on the vagaries of the client platform (i.e., it runs on IE6). Just spelling out what I’d been thinking about when I read this.)

You might want to look at content-based addressing. The idea is that whenever we want to put information on the web, we take the document and take a 128-bit (or larger) hash of said document. Then that hash is the permanent reference. if anybody wants to retrieve this document, they submit this hash to a retrieval service. Authenticity is obtained by hashing and availability is guaranteed by having lots of retrieval services. Name —> Hash can be implemented in the system itself (by making a root-document style of entry —- essentially a directory structure).

Van Jacobson is doing some work in this space: pages.cs.wisc.edu/~akella/CS838/F09/838-Papers/ccn.pdf

You can easily overcome the CA point of failure by using history based trust model (CAs are part of a delegation of authority trust model)

How do you do that? By basing your censorship circumvention system on a builtin social network. This way you put trust in the right entities : identities (people). Not content and not arbitrarly chosen authorities.

This is something search engines should do automatically. Google could easily compute the secure hash of every page it crawls and provide automatic links to mirrors (defined as pages with the same secure hash) of pages that are no longer available (or just under high load). It could even provide an indication of how complete a mirror is by the percentage of files linked to by the original domain are also present (and match the secure has) at the mirror.

A very similar service exists now in the form of URL shortening services. It would not require much work to implement one-to-many relationships between the shortened versions of URLs and the mirrors of the content. Automatic detection of mirrors could be accomplished by keeping hashes of the original content and comparing any new URL submitted for shortening to the hash database and merging detected mirrors into one shortened URL. A recursive hash tree would be necessary to prevent spoofed mirrors, and a reverse lookup added for users who know the original URL but can’t access the site because it is down or censored.

The only thing missing from either of these descriptions is a notion of authoritative publishing, e.g. detecting when an original URL points to an updated, but authoritative, site and the mirrors aren’t updated yet. In this case, it would clearly have been in everyone’s best interest if the web had been designed with signed content from the beginning. If a suitable standard could be found for signing the content of web pages (even something as simple as dropping index.html.gpg, image1.jpg.gpg or a collective MD5SUMS.gpg into the webroot), mirror detection services could automatically recognize newer authoritative content by a newer signature date. Mirrors of signed content would be identified by a common set of public keys used to sign their content, and when an update occurred it would be recognized that the mirrors no longer matching the content are still mirrors, but simply out of sync. This would prevent the ability of attackers to spam old mirror content or otherwise attempt to add mirrors with disinformation.

I think your scheme could be extended to support OpenPGP signatures in addition to (and maybe in preference over) hashes plus SSL certificates. The publisher would publish a signature of the contents of a URL instead of just a hash of a URL. A hash is still included in the signature for the purpose of uniquely identifying content, but it is tied to a publisher. This would not require additional trusted SSL certificates for mirrors, just the presence of the signed content on any mirror sites.

There are still attacks possible where lots of rogue mirrors are published with a different key in an attempt to steal the authenticity of all mirror sites away from the original publisher. Basically the only way to circumvent such a situation is with a web of trust like PGP implements. Take a situation where wikileaks is completely censored and forced offline. At this point any number of mirrors may exist but it is now impossible to tell which may be being updated by the “real” wikileaks crew and which may be subverted. If a majority of mirrors (which includes many fakes created by attackers) start to publish new information, there is no way to tell with SSL certs alone whether the updates should be trusted. This is kind of a far fetched attack, but it makes sense to avoid any potential weaknesses during the design phase. Provisions for recognizing revoked PGP keys should also be in place for the case where publishers know they will be arrested/interrogated/subverted and further updates with their key should not be trusted.

You’ve got the right idea, but this bothers me:
“Conveniently, there’s already a whole industry of trusted businesses that make these statements — they’re called SSL certificates.”
Can we really trust any businesses in this scenario?

You should look at Freenet; it’s doing something very similar already. You discover nearby nodes and when you want to retrieve a file (by hash), you ask them if they have it. If they don’t, they ask someone else, and relay it back to you. This means there’s no way to tell if someone is downloading the file for themselves or just passing it along.

Each node may keep copies of some of the content as well, deleting old files that haven’t been requested in some time - so there are many copies of every file floating around, disappearing only when nobody wants them anymore. Since the files are encrypted (part of the URL serves as the key), the nodes can’t tell what is in the files they decide to keep, so they can’t try to censor or alter anything and can’t be prosecuted for keeping it.

Separate from censorship-resistance, this system would also allow ad-hoc p2p caching of content — to handle traffic surges or provide a level of anonymized viewing.

The domain names seem to be like handy pet names for certificates. The proposed ‘tdb:’ URI scheme could prove useful for referring to expired mappings of domain->cert or URI->content. (‘tdb’ seems to have subsumed ‘duri’ in Masinter’s latest draft: http://larry.masinter.net/duri.html )

A URI scheme giving every signing-key its own path-like namespace could also be useful, as with the hypothesized ‘kau’ and other schemes we discussed in this 2002 thread:

Freenet, Tor, and DHTs seem pretty hard to censor but involve a lot of work for end users. Unsolved problem seems to be to make a way to find censored content that just requires the end user to hit a locator URL (maybe at one of many domains), gives the locator services some separation from the actual content hosts (making it “safer” to host a locator), and is pretty resilient when content hosts go down.