Tuesday, July 12, 2005

The Way Back Machine and Robots.txt

On July 8th, a complaint was filed in the United States District Court for the Eastern District of Pennsylvania, Healthcare Advocates, Inc. v. Harding, Early, Follmer & Frailey, et al. This is such an extraordinary document that I will break with my usual practice of not commenting on complaints or motions. Those who decry the DMCA as an (attempted) tool of oppression will find more than ample support in this effort. Other laws are implicated too, including some I venture to guess most IP lawyers have never heard of at least in the IP context, for example, a Greta Garbo like claim for "Intrusion upon Seclusion." Others, such as the Computer Fraud & Abuse Act and trespass to chattels have become better known recently but are invoked here in a novel way, to say the least. In my opinion (and all this is opinion whether denominated as such or not), the Healthcare Advocates complaint represents a misuse of the legal process.

The complaint appears to be the result of an earlier failed suit brought by Kevin Flynn and Healthcare Advocates (Flynn is the President) against Health Advocate, Inc. and others for various trademark and related type claims. Three opinions in that case should be noted, 2004 U.S. Dist. LEXIS 293 (E.D. Pa. January 13, 2004)(dismissing a number of claims), 2004 U.S. Dist. LEXIS 12536 (E.D. Pa. July 8, 2004)(denying plaintiff's motion to amend complaint and denying defendant's motion for in camera review of the documents in question), and 2005 U.S. Dist. LEXIS 1704 (E.D. Pa. Feb. 8, 2005)(dismissing remaining federal claims and declining to exercise pendent jurisdiction over state fraud claim).

During the investigation of plaintiff's claims, a law firm for some of the defendants utilized the not-for-profit Internet Archive Wayback Machine. The Wayback Machine lets one access archived versions of websites. You type in the URL, select a date range, and presto, you can surf an archived version of the web page in question. It is a phenomenally important archive, useful to people throughout the world, including parties in lawsuits who want to find out what their adversary was saying in the past on a website that has been updated or revised potentially hundreds of times since the events in question. The Wayback machine contains about 1 petabyte of data, more than that in the Library of Congress, even though the archiving only began in 1996. The archiving is accomplished by the Alexa webcrawler.

The Wayback machine is not, however, interested in archiving material website administrators don't want archived, so it has developed a number of ways for people to say, "Please don't collect our stuff." You could telephone the Internet Archive and tell them not to. Or, you can utilize the SRE (Standard for Robot Exclusion) to specify files or directories that cannot be crawled. This is accomplished by a file called robots.txt. (Here is a short article on the Wayback machine and robot exclusion from Wikipedia, and here is a more technical explanation, Robots.txt.) Use of robots.txt is entirely voluntary and many webcrawlers do not utilize it, although the Alexa webcrawler is programmed to obey the robots.txt instructions, and in fact is constructed so as to block, retroactively, files in existence before the instructions were inserted.

Back to the Healthcare Advocates case. The complaint in the earlier suit against Health Advocate, Inc. was filed in June 26, 2003. Healthcare Advocates had been operating a website, www.healthcareadvocates.com since 1998. In July 8, 2003, the robots.txt instructions were inserted. The next day, it is alleged, defendant's law firm tried to access archived Healthcare Advocates website material. In the court's July 8, 2004 opinion, an allegation is recited that between July 8, 2003 and July 15, 2003, 849 attempts were made to access the archived information, of which about 112 attempts were successful. Presumably, all of the material was pre-July 8, 2003 information.

Plaintiff sought to amend the complaint to bring claims against the law firm for this activity, but the court denied the motion. After plaintiff's complaint was dismissed as noted above, this new complaint against the law firm, its members and employees, and the Internet Archive was brought last Friday, July 8th.

There are 12 counts, too many to recite on this already too long blog. I will only talk about one, the DMCA claim, an alleged violation of Section 1201(a): "No person shall circumvent a technological measure that effectively controls access to a work protected under this title." It is alleged that the robots.txt denial text string is a technological circumvention measure and that defendant law firm circumvented it. This claim, in my opinion, is factually and legally wrong. Factually, at least from the complaint, it does not appear that the law firm "circumvented" anything, if by circumvent we mean devised a mousetrap to bypass the denial text string. Instead, it seems as if defendant kept banging on the URL until, for whatever reason, the denial failed to be recognized. This is like going down a row of houses and trying doors to see if they are open. If they aren't you move on until you find one that is. If it is open you walk in, but you certainly haven't circumvented an access control mechanism.

But as importantly, I don't see how the robots.txt can meet the 1201(b)(2)(B) definition of a technological measure: it is a voluntary protocol, operated if at all not by the copyright owner but by a third party, and not all third parties have agreed to use it. The definition of a technological measure is one that "effectively protects a right of a copyright owner ... if the measure, in the ordinary course of its operation, prevents, restricts, or otherwise limits the exercise of a right of the copyright owner under this title."

In the ordinary course of the operation of plaintiff's website only those webcrawlers that had voluntarily agreed to do so would restrict access, and many don't. That can hardly meet the effective protection standard contemplated in the definition. And as a policy matter, plaintiff's theory would encourage good government archivists like the Internet Archive not to use voluntary measures on pain of a DMCA violation. Nor can one say that there was any quid pro quo here: the webpages in question were publicly available long before plaintiff decided to restrict access in conjunction with a much later filed lawsuit. And that is the worst policy of all.

49 comments:

I'd like to suggest a different interpretation of their DMCA claim (while acknowledging that the complaint is not clear): that the robot.txt file operates as a TPM as used by the Internet Archive. The standing provisions of the DMCA have been interpreted broadly, so perhaps the plaintiff here is arguing that the Internet Archive has implemented a TPM that controls access to its archived materials. The robot.txt file is intended to block external access to these materials, and was bypassed by the defendants. (I'll admit, this sounds like the Archive's claim to bring, not the plaintiffs', but the DMCA's standing provision has been stretched before.)

I think the claim still fails for the other reason you note. But I don't think the complaint need necessarily be construed as arguing that robot.txt is a TPM generally.

I have a pet peeve on an analogy you used and that is frequently used by others when dealing with Internet security issues: "This is like going down a row of houses and trying doors to see if they are open. If they aren't you move on until you find one that is. If it is open you walk in, but you certainly haven't circumvented an access control mechanism"

Almost every state considers the unprivileged opening of an unlocked front door to be an unlawful entry. It is most certainly a trespass. The point of unlawful entry or trespass occurs at the moment the door is opened and the threshold is crossed. The whole purpose of a house, locked or not, is to protect the contents and to act as an "an access control mechanism." The DMCA anti-circumvention provisions - - conceptually a stupid idea for a copyright act (you might as well bootstrap all contractual disputes with respect to a copyrighted work into the copyright law, too)- - are designed to protect copyrighted works when the owner puts them inside a house, so to speak. Even leaving the door ajar doesn't suggest that anyone can come in and help themselves to the silverware.

To make a somewhat different version of Fred's point: I see the complaint as arguing that plaintiff's robots.txt file was a TPM because, in the normal course of affairs, it prevented people from accessing plaintiff's content via the Internet Archive, and therefore it "in the ordinary course of its operation . . . limit[ed]" an exercise of a right of the copyright holder.

Oops -- like Bill, I quoted 1201(b) rather than 1201(a). The relevant definition of a TPM is a measure that "in the ordinary course of its operation, requires the application of information, or a process or a treatment, with the authority of the copyright owner, to gain access to the work." The theory of the complaint, I think, is that the robots.txt file does that because it normally precludes access via the Internet Archive. But does that hold together?

How can a convention - adherence to which is optional, by definition - be considered a protection measure? Isn't that rather like having a lock with a lever that says 'open me', as well as having a hole for a key?

The problem seems to be the retroactive hiding of pages by the Internet Archive, something that isn't specified by the robots.txt definition. The normal definition of blocking access via robots.txt is that *future* crawls will skip those pages, not that all previous references to that page are deleted. Why the Internet Archive takes this additional step is beyond me.

In any case, it is hardly an effective TPM, since they could have gotten the page from the Google cache, which doesn't retroactively delete.

I hope that this legal action sets a precedent on this matter, so that future lawsuits of this nature are harder to bring.

There is a text version of the complaint at: http://www.ip-wars.net/story/2005/7/12/185442/034

On Fred von Lohman's and Jon Weinberg's point, isn't that reducing the definition to the point of absurdity: so long as something is a TPM for one person in the DMCA applies. And then of course it will always be effective for that one person. I had assumed, whether rightly or not, that Congress was referring to a TPM of general application.

BTW, when still working on the Hill, a company that shall go nameless but that provides anti-circumvention protection for the motion picture industry asked us to put an anti-circumvention measure in the GATT. We said, but don't you have a patent? They said yes. We said, and isn't what you're complaining about (at that time) an infringement of your patent? They said yes. We said, so why not sue for patent infringement? They said its too expensive. It was a nice lesson that some in the private sector view Congress as a cheap alternative to patent litigation.

I don't care for the complaint as I think that they got it legally and technically wrong.

That said, however, I think that there is something important that is often overlooked in these archiving schemes which does not sit right. Under the terms of WbM's use, the author has to opt-out not opt-in. Whether you like the DMCA or not, that doesn't sound like traditional copyright at all. The WBM isn't just excerpting sections, it is copying verbatim everything and redistributing it. Worse yet, it may be "taking" content and author's may not even know it.

About 5 years ago, I litigated in the SDNY and Second Circuit Register.com v. Verio (for plaintiff).Opt-in versus opt-out was a big issue for terms of use and privacy in that case.

With publicly available websites (by which i mean non-password or otherwise protected), though, it seems there should be a healthy implied license, understanding that the license will be defined by a number of things, like custom. I would hope that at least something like the Internet Archive would fall within such a license. Amendment to 17 USC 108 is another option.

If the WBM is relying on 17 USC 108, they're going to run into problems regarding copying. While a library or an archive in a traditional sense could conceivably avail themselves of such a thing and then make it open to the public, it's difficult to see how a digital archive like archive.org my equally be safe harbored. The language states pretty plainly that it protects the archiver when reproducing "no more than one copy or phonorecord of a work". We know from some (c) law that computers make many copies in the process. This is especially true for a site that then serves it to the public (retrieving a page once by one client then creates two--one on archive.org and the other in the user's cache).

I agree with mmmbeer that 108 doesn't cover everything the Internet Archive is doing; my point was that perhaps an amendment to 108 might be an approach. And with such a proposal, we could have a good public debate with policy makers about what types of uses we should permit.

Circumvention of WHAT? A robots.txt file is merely netiquette, a means of asking the bots to "please leave this one alone." While most of the major search engines' crawlers abide by this, there are plenty of others that do not. It was never understood that obeying a robots file was mandatory. Will we go down the DMCA slippery slope and see actions against anyone whose crawler "circumvents" someone's robots file? Give me a break.

kevin -For starters, that's why I think that the complaint is technically (as in technologically) deficient. But, the more I think about it, the more I think that what archive.org does IS probably wrong.

Conceptually, archive.org's policy seems to place the "burden" on the wrong party. The burden, as I understand it, shouldn't be on the copyright holder to do anything more to prevent the wholesale, exact copying and distribution of everything on a site (adding a robots.txt or calling them or e-mailing them)--except insofar as litigation is required. We, in fact, would expect parties engage in similar "real world" behavior to obtain either consent or license. As noted above, 17 USC has a number of exceptions, but they all seem deficient or simply not applicable.

As Mr. Patry suggested, archive.org might get a pass on an implied license of some sort (indeed, one could evidence a number of sites that do similar things: google images, etc). However, I'm not sure that even an implied license would go so far as to permit wholesale copying of everything on a site, in repetition, with unlimited reproduction and distribution rights. That seems a bit beyond what someone might expect another has the rights to do with their property. At least, it would seem that a good lawyer could poke holes in any such defense pretty quickly. More plausible might be a judges more liberal reading of archiver exceptions.

I agree with Kevin Brady on the netiquette remark for the DMCA claim, and I also agree with mmmbeer with his concerns about the copyright issue: one can, after all, defeat an implied license defense to copyright infringement just by saying "No more from now on out" and that is what the denial string request was. There remains, of course, a fair use defense.

What you and Mr. Patry said about an implied license sounds like a good idea. Be interesting to see how the court interprets this case and if they apply such a concept. However, I think your idea of an "archiver exception" is better. There needs to be some definitive fair use coverage for this sort of thing. If plaintiff's counsel can poke holes in an implied licensing defense, and the courts buy into it, a lot of search engines will be in big trouble, as an adverse ruling could also impact their ability to cache web pages (which is also the wholesale copying of sites).

I think of this as somewhat like billboards placed along the highway. We can all look at them as we drive by, just like we can surf into websites whether they are indexed or not. A robots tag is like saying "please don't photograph my billboard." Will that stop people from taking photos? To some extent, but you can bet that many still will. Then the question becomes, is photographing that billboard actionable? The answer may depend on the use of the taking, and that's where fair use should be looked at. Indexing and caching websites should be an activity that is encompassed within the fair use doctrine, or else the whole utility of search engines is greatly diminished.

The bottom line is: if you don't want crawlers getting into your online content but you want your users to, there is an easy technological cure. Use an image code verification script that requires the user to manually enter a randomly-generated character string. This is commonly used to exclude bots.

First, photographing billboards analogy is plainly NOT the same slavish, wholesale copying of everything an advertising company or commercial entity has created that is the unlimited reproduced and redistributed. It would be a problem, and i'm sure that Mr. Patry would agree if you did go around copying every billboard, compiling it and subsequently making it available for free to everyone on your terms.

Second, I'm not sure that the temporary cacheing (as in google's cache) of the latest version of a website really is the same thing either. Most obviously, the cache is temporary, marked up as a cached copy, and does not necessarily contain every element (as the cache usually does not contain stylesheets, images, scripts, etc).

Finally, again, I'm not sure that making the copyright holder work harder is really the correct burden (as in the making of silly verify-as-human tricks). Moreover, most websites WANT to be found, spider and indexed. That seems plainly within a custom and usage implied license. That doesn't mean that they want or expect the aforementioned "archiving."

Hi,Nice blog. A friend of mine checked out your blog and told me to come here and hopefully find out more about earn extra income opportunity as I am looking into joining earn extra income opportunity. You can't do enough research before going with the best program. Thanks for the info. Have a great day :-)

About Me

This is a personal blog, not a Google blog. It is about my book Moral Panics and the Copyright Wars, published by Oxford University Press. Please don't attribute anything in the blog or the book to Google, which employs me.