Caching Makes Web Faster, But Might Hurt Business

Share

Caching Makes Web Faster, But Might Hurt Business

Caching proxy servers are at the center of growing tensions between Web surfers and the advertisers who are paying for much of the content. While caching makes good technical sense for both Web sites and Web users, it could be bad for business, and may actually violate copyright law.

A caching proxy server sits between a Web surfer and the rest of the Internet. Instead of trying to download HTML files and images from random Web sites throughout the world, you simply send your HTTP get requests to the caching proxy server. The proxy server checks to see if it has downloaded the document recently. If it has, you get a copy from the proxy's cache on its hard drive. If the file isn't there, the proxy issues its own HTTP get request to the Internet at large, sends you a copy of the file, and keeps a copy for itself in case somebody else asks for it.

At first it might seem that a caching proxy server would slow down surfing, but it doesn't. That's because you normally have a fast connection to the proxy server itself - it's sitting there on your company's firewall, or it's on your ISP's internal network. Often, there's only a small delay going through the server to get new material off the Net. And if the file's already in the cache, you save sending the data over the Internet entirely. So Web users love caching servers: They make the Web run much faster.

Web sites benefit from caching servers, too. A large organization can set up a Web site and just download a single copy of a popular HTML page, rather than retrieving an identical copy for every person sitting behind the corporate firewall. That means a site can have a million readers a day but only have a thousand hits, which in turn means you don't need to have a whole farm of Web servers in order to serve up popular content.

But most advertiser-supported Web sites are terrified of caching proxy servers. That's because there's no way to differentiate between a single reader and a hundred thousand when you're sitting at the other end of the cache without looking at its log files. It's hard to tell an advertiser that the single hit you got from America Online really represents 100,000 readers. It's easier to demand that AOL stop caching. And these sites have international copyright law on their side. Any copyrighted documents stored in the proxy server's cache are fundamentally unauthorized copies.

It turns out you don't need to call the lawyers to disable the cache. All you have to do is put the header "Expires: 0" in your Web server's HTTP response. This tells most caching proxy servers that what they have previously cached is already out of date. There's also a Pragma: no-cache header, which tells the proxy not to cache the information at all.

But that's the wrong solution. "A lot of information providers don't understand that caching may be to their benefit," says Jim Gettys, a visiting scientist at the World Wide Web Consortium who just happens to be the editor of the HTTP 1.1 specification.

With the cache turned off, some AOL users are getting bad performance, which makes them less likely to look at the Web pages week after week. Even worse, Web sites must now respond to every single AOL user on those cacheless proxies that make a request.

A better solution would be to make the HTTP protocol that's used to download Web pages more cache-friendly. A Web server could let a caching proxy keep a copy of a Web page if the proxy server promised, in turn, to tell the Web server the number of hits it received for the page over a reasonable time period.

HTTP 1.1 has increased support for intelligent caching, including a new Cache-Control: header which lets caching proxy servers do something more intelligent with pages they want to cache.

The Cache-Control: header is quite flexible for controlling caching proxy servers, and used properly it could go a long way toward solving the underlying problem. That's because the Cache-Control: header lets you individually mark the caching attributes for anything you might download over the Web.

If you're delivering a personalized newspaper to a user, you might put in a Cache-Control: private message. This indicates that the file is for a single user and must not be cached for general access. You might use Cache-Control: public in the big GIFs, JPEGs, and Java applets that are downloaded. No reason not to cache them locally. Finally, you might put a Cache-Control: no-cache in those all-important advertisements that are downloaded. At least that way, you'll be able to give your advertisers some meaningful statistics.

You can download the entire HTTP/1.1 draft from the W3C site. It makes for fascinating reading if you're a protocol engineer. (I must have been one in a previous life.)

But HTTP 1.1 doesn't have any mechanism for reporting how many hits a particular Web page receives. That's because people are arguing about what information that report should contain.

"Advertisers would like to know everything, including your mother's maiden name if they can get it," says Gettys. Instead, the next specification will probably just return hit-count information. That's not really sufficient for advertisers: They would like to know where the hits are coming from. At the very least, an advertiser might like to know whether those 200 hits all came from the same user, from 20, or from 200.

Ironically, on the day I wrote this article Netscape announced its new Netscape Proxy Server 2.5 for Unix and Windows NT. The program caches Web pages and scans for viruses at the same time. (It also has some nifty features for violating employee privacy, like keeping a running log of who downloaded which Web page, but that's another issue entirely.)

So while the technology matures, the debate heats up. Perhaps the next HTTP protocols can settle the score.