Alex MacGillivray explains the Google Books settlement

John Palfrey introduces Alex MacGillivray as one of the first students of the Berkman Center. He’s now a major player in the world of internet law, formerly serving as Senior Product and Intellectual Property Counsel at Google. He’s represented Google in a number of key cases as one of the first lawyers to join the company. Twitter recently “stole” him away, appointing him as General Counsel, which may leave him for free to talk about the Google Book Search settlement, the conversation around our lunch table today.

Alex explains that the goal of Google Books was to make books easier to find. He references an article in the New York Times in which librarians lamented that people were only searching online materials, not printed books. He references a story in which a research assistant was asked by Larry Lessig to come back with “everything Senator X said about topic Y” and returned only with results after 1996… which is to say, only results from the web.

There’s a set of books that are “born digital” which are low-hanging fruit, as digital copies already exist. Other new books are easy to obtain and scan. Harder books include ones where rights are unclear, or public domain books, where there simply isn’t very much money available for people to scan them. It was easy, Alex says, for Google to decide to scan the public domain books. And it was easy to decide that it wouldn’t be sufficient just to do the public domain ones – it was essential to partner with people who owned and could provide these books, both publishers and libraries.

If you’re using search and getting a page from a book, you’re probably encountering a book from a publisher who’s partnered with Google. The idea is to make it easier for you to find and decide to purchase the book. If you’re simply getting a snippet, it’s probably a library book where Google doesn’t have a partnership with the rightsholder – the idea is to give you a taste of the book and send you to a publisher or a library to get the book. And in the case of public domain books, you’re able to download the book as a PDF or as text.

So far, Google has scanned more than 10 million books. That quantity of books meant that Google needed to invent a whole new technical apparatus for scanning books… and Google had to physically pick up and scan those books. More than 1.5 million are in the partner program, and a comparable amount are in public domain. There are more than 40 libraries working as partners. And Alex says that lawsuits – two in the US, one in France, and one from Germany which has now been withdrawn – haven’t slowed Google’s pace of scanning books. One US suit is a broad action from a large set of authors – another comes from five of the six largest US publishers.

Alex notes that Google usually thinks very big – this, he says, is one of the few times that other parties at the table were thinking bigger than Google was. Most parties shared Google’s excitement about making sure that more books got read. And there was widespread consensus about the importance of ensuring that people without means could access this information through libraries.

Under the settlement, any person in the US is able to get search results, locate the book in libraries (via Worldcat) and, if a book is out of print, read 20% of the book, for free. The goal there was to get the benefits that came from the publishing partner program for books that were out of print. Alex remembers checking dozens of books out of a library and scanning through to figure out which would be useful for his thesis – the goal was to create a similar mechanism online. People can also purchase full, permanent access to a book online, priced either by a copyright holder, or through an algorithm that simulates a market for these books. Over 50% of these books are $5.99 or less, while 80% are less than $15.99.

For institutions like Harvard, there are institutional subscriptions that give widespread access to the collection. Institutions pay a fee for access, and then anyone at the institution has access to the books provided. And there’s a third public-access model – any public library in the US will have at least one terminal which can access the subscription materials for free. He points out that there’s a provision for “first-class access” for making books available to people with disabilities, especially for the books that are hardest for people with disabilities to access, like old, out-of-print books.

There are exceptions to the settlement, like the owners of pictures within the books – they’re not currently included – Alex encourages people to look up the class definition of the settlement.

Palfrey suggests that Alex talk about orphan works. Alex explains that Google – or at least “the part of Google which is me” – has been fighting for access to orphan works for years. The settlement includes orphan and non-orphan works… though Alex acknowledges that there’s no clear definition of orphaned works. He offers the definition: “works where the rightsholder is really, really hard to find.” His colleague, who is responsible for much of the technology of Google Books, points out that there’s another problem – providing access to books that might not be economical to scan and offer access to online. Revenue for publishers that comes from accessing orphaned works goes to the books rights registry, which is charged with searching for publishers to compensate for use of their works. This said, Alex makes clear that the settlement is structured so that if the US resolves the status of orphaned works under copyright law, the settlement will be updated to reflect that (presumably better) law.

Unsurprisingly, there’s lots of tough questions for Alex and his colleague Dan Clancy, who was technical lead on the project.

Chris Soghoian wonders out loud whether Google’s assurances in this agreement that they won’t be evil are sufficient. Alex pushes back and notes that these aren’t just assurances – they’re legal guarantees. Chris comes to his central concern, which is that “Google has thrown fair use under a bus.”

This contention angers Alex. “We’ve got more current fair use cases than anyone else. It’s convenient to say that we’re against fair use, but it’s bullshit.” Dan offers a more nuanced response, explaining that Google is engaged in several practices – scanning all images without getting releases, scanning unregistered books – that should clearly demonstrate that they believe in and are defending fair use.

Lewis Hyde wonders about the books registry established in the settlement. He points out that the board of directors for the registry comes from publishers and authors while the universe of people who use books includes readers. And he wonders where the money earned from orphaned works will eventually go.

Alex explains that the registry doesn’t make decisions about what users can or can’t do with works, which suggests that perhaps representing users in that process is less appropriate. (It does leave open the question of where users get represented.) As for the money from orphan works – 63% of all revenue goes to publishers, with 37% to Google. (This split is true for subscription revenue as well as for directly purchased works.) With orphaned works, the money goes to the registry in a fund that can be claimed if publishers claim their works. If the money is unclaimed for five years, it can be spent by the registry to seek out rightsholders. At a later point, the remaining money will go towards nonprofit organizations that benefit readers and writers.

Lewis objects that this doesn’t answer the key question – is it rightsholders who should actually get paid this money from orphan works. Alex offers that there are lots of ways to settle the question – give the money towards defending and enforcing copyright, towards fighting and overturning copyright, towards universities. The advantage of doing it this way is that it’s likely to encourage more rightsholders to come forward, lessening orphan works problems.

Phil Malone wonders about clauses in the agreement that give Google “most-favored nation status”. If someone else comes in with a deal to do book scanning for less money, Google gets the right to offer an equivalent deal, which takes away one of the few advantages anyone would have in competing with Google. Why should this be justified under anti-trust law?

Alex essentially acknowledges that this clause gives Google a competitive advantage. “We believe this is such a good idea, we believe others will copy it. This makes it very easy for a second entrant,” as Google has already done and published terms of a settlement with publishers and authors. He explains that Google asked for this clause for ten years because this is a fairly popular term under antitrust laws.

I asked a question about access to the Google data as a whole set, not as individual texts. I’m interested in the utility of these millions of digital documents for lexicographers, or for machine translation researchers. I pointed out that Google probably already has the largest set of translated texts in the world, which is a key step in creating parallel corpora, essential for machine translation. Dan points out two critical uses of corpora I didn’t mention – optical character recognition research, and research on new document search techniques.

The answer to the question: “no body in the current world” has access to the complete database (other than Google.) But that’s changing “Because of copyright liability, we don’t open the database to everyone else… Library partners have only subsets of that information – Stanford doesn’t get Harvard’s works.” That said, there’s a provision for two research centers at each partner institution which will have access to the entire corpus for “non-consumptive, textual, computational analysis.” (“Non-consumptive” means “for uses other than reading and understanding the work,” including the text analysis examples offered above.)

Google wants to see these centers created, Dan says, and has put up $5 million to establish them. But it’s going to be up to the libraries to run them and determine whether researchers are creating appropriate uses. Universities could sponsor any researcher – including those outside the university – to access that corpus.

No one has yet built one of these centers, in part because this access is part of the settlement, which is still being challenged in court. But Dan makes clear that Google can make no claims on intellectual property on people building new search or translation algorithms that are trained on the data.

I asked the question because I was concerned at the recent Open Translation Tools summit that innovators might stop working on translation projects believing that Google now has an insurmountable lead in machine translation and will obviate other efforts in the near future. I don’t believe that’s true and I think it would be a mistake for people to stop working on machine translation and other strategies.

What’s exciting is that some researchers will be able to access this huge corpus if the settlement goes through – that’s exciting for projects like Wordnik, as well as for anyone researching search or translation. What’s a little worrisome is that Google partners like Harvard may have liability concerns that restrict access to the corpus – if Harvard is worried about liability from me potentially releasing sensitive material, they may restrict access to this critical corpus. One path forward might be approaching Harvard – which already has access to a huge corpus of material that Google has scanned for the Harvard library – and seeing whether it’s possible to build machine translation corpora from this existing data. Think that might be something I add to my to do list.

4 Responses to Alex MacGillivray explains the Google Books settlement

This is fascinating – thanks for blogging the discussion in such fine-grained detail.

I wonder if your conclusion that “Google [doesn’t] have an insurmountable lead in machine translation and [so won’t] obviate other efforts in the near future” follows from what he said. It’s fantastic that the entire corpus will be made available for computational analysis by research centers at partner libraries, but — would they be able to do anything other than pure research with it? Would anyone, for example, be able to set up a for-profit translation business using the corpus? If not, that still leaves Google with a huge competitive advantage.

Darius, sorry the post was unclear on that point. Researchers will own their IP when working with the Google corpus. That means that it would be possible to build a machine translation company at Harvard using the Google corpus, spin it out and monetize it without owing any IP to Google.

There’s another question, which is whether it’s reasonable for academic researchers to try to compete with Google’s in-house team on the subject of machine translation. I think it is, and I think that deciding not to compete with Google will guarantee fulfillment of a prophecy that Google will dominate machine translation in the future… :-)