Sunday, October 25, 2009

Distributed HTTP

A couple years back, a friend of mine got into an area of research which was rather novel and interesting at the time. He created a website he hosted from his own computer, where one can read about the research, and download various data samples he made.

Fairly soon, it was apparent that he couldn't host any large data samples. So he found the best compression software available, and setup a BitTorrent tracker on his computer. When someone downloaded some data samples, they would be sharing the bandwidth load, allowing more interested parties to download at once without crushing my friend's connection. This was back at a time when BitTorrent was unheard of, but those interested in getting the data would make the effort to do so.

As time went on, his site popularity grew. He installed some forum software on his PC, so a community could begin discussing his findings. He also gave other people login credentials to his machine, so they can edit pages, and upload their own data.

The site evolved into something close to a wiki, where each project got its own set of pages describing it, and what was currently known on the topic. Each project got some images to visually provide an idea of what each data set covered before one downloaded the torrent file. Some experiments also started to include videos.

As the hits to his site kept increasing, my friend could no longer host his site on his own machine, and had to move his site to a commercial server which required payment on a monthly or yearly basis. While BitTorrent could cover the large data sets, it in no way provided a solution to hosting the various HTML pages and PNG images.

The site constantly gained popularity, and my friend was forced to keep upgrading to an increasingly more powerful server, where the hosting costs increased just as rapidly. Requests for donations, and ads on the server could only help offset costs to an extent.

I imagine other people and small communities have at times run into similar problems. I think its time for a solution to be proposed.

Every major browser today caches files for each site it visits, so it doesn't have to rerequest the same images, scripts, and styles on each page, and conditionally requests pages, only if they haven't been updated since what was in the cache. I think this already solves one third of the problem.

A new URI scheme can be created, perhaps dhttp:// that would act just like normal HTTP with a couple of exceptions. The browser would have some configurable options for Distributed HTTP, such as up to how many MB per site will it cache, how many MB overall, how many simultaneous uploads will it be willing to provide per site, as well as overall, which port it will run on, and a duration as to how long it will do so. When the browser connects via dhttp://, it'll include some extra headers providing the user's desired settings on this matter. The HTTP server will be modified to keep track of which IP addresses connected to it, and downloaded which files recently.

When a request for a file comes into the DHTTP server, it can respond with a list of perhaps five IP addresses to choose from (if available), chosen based on an algorithm designed to round robin the available browsers connecting to the site, and the preferences chosen therein. The browser can then request via a normal HTTP request the same file from one of those IP addresses it received. The browser would need a miniature HTTP server built in which would understand that requests coming to it that seem to be for a DHTTP server should be replied to from its cache. It would also know not to share files that are in the cache which did not originate from a DHTTP server.

If requests to each of those IP addresses have timed out, or responded with 404, then the browser can rerequest that file from the DHTTP server set with a timeout or unavailable header for each of those IP addresses, in which case the DHTTP server will respond with the requested file directly.

The HTTP server should also know to keep track of when files are updated, so it knows not to refer a visitor to an IP address which contains an old file. This forwarding concept should also be disabled in cases of private data (login information), or dynamic pages. However for static (or dynamic which is only generated periodically) public data, all requests should be satisfiable by this described method.

Thought would have to be put into how to handle "leach" browsers which never share from their cache, or just always request from the DHTTP server with timeout or unavailable headers sent.

I think something implemented along these lines can help those smaller communities that host sites on their own machines, and would like to remain self hosting, or would like to alleviate hosting costs on commercial servers.

11 comments:

Browsers which only request data but don't support sharing data could be handled as HTTP is handled now, and don't particularly have to be worried about.

As for browsers which would attempt to request with headers specifying it could not get the file elsewhere, the server could employ some techniques to discourage such an occurrence.

The server could require all requests which have timeout headers specified to have made the same request to the server without timeout headers set "10 seconds * number_of_IPs_returned" prior. Otherwise they can be ignored.

Browsers would then have it in their best interest to attempt to get the file elsewhere first.

I'd be concerned about the security of the files you're getting from an IP, as they could replace a segment with new data and flub the metadata to look the same, so it seems like you'd have to have the source send back a series of hashes in the header, and then hash-check all files not recieved from the source.

Just like AJAX was possible for years before anyone realized it, this is also already possible with existing technologies. Flash 10 already supports a peer-to-peer network -RTMFP (Real-Time Media Flow Protocol)- many sites already use it will bandwidth heavy media files. And Flash can communicate with JavaScript. So you can have a flash applet that downloads from the peer-to-peer network and then using javascript writes the output to the page. Are there any Flash programmers interested in trying this?

Mr: You seem to have missed the point entirely. He's not saying "use bittorrent to get your html" he's saying "use http from a peer to get your html". That system doesn't have hashes in place, because that system doesn't exist yet. I was simply pointing out that such a system would require file verification.