D-Lib MagazineDecember 2006

Volume 12 Number 12

ISSN 1082-9873

(This Opinion piece presents the opinions of the author. It does not necessarily reflect the views of D-Lib Magazine or its publisher, the Corporation for National Research Initiatives.)

Abstract

In October the University of Chicago Press published Google and the Myth of Universal Knowledge: A View from Europe, by Jean-Noël Jeanneney, President of the Bibliothèque nationale de France [1]. English speaking readers should take what Jeanneney has to say seriously (as well as the critique offered in the Foreword by Ian Wilson, Librarian and Archivist of Canada), both because it resonates with European cultural politics  and has succeeded to a significant extent in motivating a movement to digitize European print heritage  and because much of his case against Google Book Search serving as a building block of the digital library of the future is, in fact, compelling [2].

Barely a month after the Google announcement on December 14, 2004, that it would digitize the "world's knowledge" (15 million books from five major research libraries in the US and UK), Jean-Noel Jeanneney responded with "When Google Challenges Europe" on the editorial pages of Le Monde [3]. His 'cri du coeur' was expanded into this book, published as Quand Google défie l'Europe: Plaidoyer pour un sursaut, in April 2005, not the least because by mid-March President Chirac had endorsed his views and called for France to take European leadership in book digitization. As is evident from the chronology of policy, funding and digitization initiatives in the updated Introduction to this English language edition, Jeanneney struck a cord not just in France or Europe, but worldwide, and his "call to arms" has engendered a dramatic increase in international collaboration in the digitization of books by governments, to counter what he identifies as the risks posed by Google.

So what is the basis for Jeanneney's opposition to Google's undertaking? It is definitely not Luddite; Jeanneney presides over and takes pride in one of the largest book digitization programs in the world at the BNF, which presents its public Web face at <http://gallica.bnf.fr/>. By expressing his opposition to Google's approach, Jeanneney was instrumental in mobilizing France and the national librarians of Europe to dramatically increase their book digitization, launch the European Digital Library, and fund the creation of a European R&D program for search technologies. And Jeanneney is not anti-Google per se; he expresses his admiration for Google's founders and its success, though he finds its methods in Google Book Search inimical to his notion of culture.

I will not attempt to portray the full range of Jeanneney's ideas on America, Adam Smith and market economics, the nature of European culture, and the many other matters that enter into his very personal essay. Instead I'll focus on five distinct critiques Jeanneney weaves through his text, which I believe are central to his message and important to anyone engaged with digital library policy. First, Google will not be able to digitize everything ever printed, and its selection will favor American, or English language sources over other cultures. Secondly, Google's presentation of texts based on keywords decontextualizes them in culturally damaging ways, and its primary interest in harvesting words to link to advertising permits sloppy imaging of the books at the expense of more carefully executed efforts. Thirdly, the Google search engine (and its business plan), promote search results that are not consistent with the rankings that scholars from the cultures in which the literature was written would approve. Fourthly, permitting a private firm to own the digital library of images and OCR'd texts is not a sound archival plan for the world's libraries or cultures, and defeats efforts to encourage value-added exploitation of this unique resource. And fifthly, Google's approach to copyright threatens the achievement of a universal digital library. I suggest taking these criticisms one at a time.

1. Selection

Jeanneney complains both that Google is digitizing from American (or British) libraries and hence predominantly from English language sources at the expense of French, other European languages and languages of the rest of the world, and that the method of selection being employed is culturally insensitive. He estimates the Western world's printed output at 100 million volumes and makes no guesses for the rest of the world, so Google's initial announcement of plans to digitize 15 million volumes over six years, breathtakingly large as it is, can be read as a selection of 15% of Western published literature. Google could say that it plans to digitize the rest as soon as possible, but Jeanneney might reasonably respond that if so, we would still be selecting what will be digitized in the next generation. Meanwhile Google continues to announce contracts to digitize from American university libraries. This helps to explain why Google's announcement was greeted with less excitement in Europe. It also reveals why those who care about internationalism are worried; as Ian Wilson explains in his Foreword, Google's activities violate the letter and spirit of the UNESCO 2001 Universal Declaration on Cultural Diversity [4].

Whether or not we accept Jeanneney's assertion that the scholars of each nation should be selecting the books to be digitized (an approach that at least doubles the cost of digitization projects over that of digitizing libraries end to end), two important points must be accepted from his argument. First, Google is privileging American, and English-language, sources with results that provide lovely anecdotes for Jeanneney to exploit [5]. Secondly, the principles of cultural diversity expressed in the UNESCO Declaration are and will be undermined on the Web by the actions of Google. We may think this is not terribly important, or consider it unfortunate but inevitable. We might believe that even if Jeanneney's vision was realized the consequences in the next generation would be selection favoring rich countries over poor. Still we must cede the point: the poorest and most vulnerable cultures will be left out, at least for a long time.

2. Presentation

Over the past two years, Google has adopted downloadable pdfs for out of copyright volumes, a more useful approach to displaying the results of its digitization than it originally announced. But the fact that images of books digitized under the Google Book Search project are now visible should increase the concern of librarians and scholars. The quality of the scans that have been made public is so poor that one could plausibly argue that they are part of Google's defense against copyright infringement, supporting the claim that the use made by automatic indexing is fundamentally different from making a copy. Indifference to showing books with pages that are cut off, photographed with hands over them, filmed at skewed angles, or missing pages, makes part of Jeanneney's case that putting this process in private hands that have a different business model than the preservation of culture is dangerous [6].

The second element in Jeanneney's critique, that the presentation of snippets, or user search keywords in the context of page images from the books decontextualizes them, echoes a critique made by ALA President Michael Gorman when Google first announced its plans [7], but goes beyond Gorman's criticism. Jeanneney is not simply interested in the context-within-the-book, but with the context of the book in culture in sophisticated ways that are worth examining. Jeanneney leads with a superficial example: searching for Cervantes in Google Book Search at the time he finished this text, produced translations in English and German before the ninth reference to a work in Spanish. This example makes a point about the fact that books exist in a vernacular tradition as part of a dialogue [8]. As Ian Wilson notes, in correctly identifying this critique at the heart of "fundamental cultural policy issues implicit in Google's massive ambition", "complex cultural nuances are forced into molds and structures built by and appropriate to one dominant cultural perspective." Jeanneney would clearly prefer that the presentation of search results contextualized works as a curator or teacher would.

3. Results ranking

Any search through billions of pages of published texts will need to show some results first and others so far down in the results set that few if any users will ever find them. Jeanneney, with a certainty born of French academic elitism, but often reflected in Anglo-American library cataloguing too, has tremendous confidence in the power of some imagined intellectual tool to classify knowledge with less bias than Google (which he suspects of showing us the "gondola end", where the goods it offers for sale are displayed). Since Google will not make its algorithms public, Jeanneney's suspicions fall on fertile ground.

Jeanneney confidently proclaims that "the enemy is clear: massive amounts of disorganized information". He articulates "the hope that the web, on a global level, might reduce inequalities of knowledge", opining that "disorganized information, however, if it dominates, will actually increase those inequalities" [9]. While we might challenge these assertions and doubt that an organization of knowledge accepted by Jeanneney would be any less culturally biased, it is likely that any bias acceptable to Jeanneney would be one that has more appeal to European policy makers than has the vision presented by Google. Jeanneney advocates the governmental funding of a European Search Engine to compete with Google. There is an element of pure European pride here, like the European Space Agency decision to fund its own GPS system, but at least in this case one can reasonably argue that search results would be different. We need not believe that this search engine would produce better results to support Jeanneney in advocating that the algorithms governing search and ranking be open for inspection, critique and improvement. We cannot help being uncomfortable, to some extent, with Google's assertions that its rankings are uninfluenced by commerce, or with results that are so clearly linguistically biased [10].

4. The Longer Term

Jeanneney's critique of Google as an archival trustee, is not about Google per se, but about the market and any commercial entity. Noting that Netscape seemed invincible before the advent of Microsoft IE, but disappeared as a company within years, he asks rhetorically what happens if Google is split up in a monopolies decision, implodes in the market or is sold? Perhaps, we should ask, what if it is sold to the Chinese, who might decide to limit content served to us as Google has recently agreed to limit that served to Chinese citizens? Is it acceptable that we are permitting a private firm to have an unknown degree of control over the future of digitized publishing culture? The recent contract with the University of California, for example, supports Jeanneney's concerns, since even the insolvency of Google would appear not to release the university from its limited rights in the data [11] and the selection of 2.5 million books to be digitized rests entirely with Google. As a matter of public policy, these kinds of agreements would be insupportable in Europe; perhaps they should not be acceptable in North America either?

We should be at least equally concerned that Google's digitization restricts access to these works for value-added use by others [12]. Jeanneney does not argue that Google, or any other private company, should be obliged to subsidize its future competitors by making scans available for their reuse, but rather that publicly funded institutions and major research libraries of private universities should ensure that future scholarship is enhanced by as wide a range of value-added re-uses as possible and, implicitly, ought not contract with entities that are not committed to that end. None of the contracts with Google made public thus far permit value-added reuse of scans made for Google in products and services offered by others. This is why the Open Content Alliance embraced mass downloading of its holdings by any commercial user [13] and why I argued at the UNESCO meetings in St.Petersburg last year for endorsement of "open source print image" as a baseline for all public digitization programmes [14]. Our present OCR methods, and techniques for mining knowledge from printed texts, are certain to be improved upon in the future; if there is not ongoing access to the corpus of digital source by potential re-users, we face the absurd prospect of having to re-digitize from printed copies for every subsequent reuse.

5. Copyright

The storm that resulted in lawsuits against Google by authors and publishers in the United States in 2005 was only gathering when Jeanneney's book first appeared in France, but the attitude that Google took to copyright was obviously not acceptable in Europe, and the disrespect to authors illustrated by Google's actual digitization since then is inconsistent with European, and especially French, notions of the moral rights of authors. Viewing Google's approach as a probably fatal misstep, Jeanneney asserts only half in jest that "whatever it does, we should do the opposite" [15].

No public body in Europe could do other than engage in a discussion with authors and publishers to arrive at mutually acceptable terms under which to digitize its print heritage.
Jeanneney is confident that doing so will also give the European Digital Library and projects undertaken by his library firmer foundations and a capacity to continue and thrive. Whatever view we may take of the likely legal outcome of the challenge to Google Book Search in the U.S., we should regret that they made so little effort to find a solution that did not require courts to rule on the question of whether what they are doing violates copyright, because either decision will leave us worse off. If Google loses, all sorts of automated processes for adding value to texts could be foreclosed. If Google wins, we can expect future publishers to include more technical and legal methods of protection that permit the copyright owners to allow or disallow various forms of use, including reading, based on contract and protected by the Digital Millennium Copyright Act (DMCA) and similar legislation.

Conclusions

Jeanneney's book is not without problems [16], but as a tract it shows an admirable precision in honing in on European political sensitivities and, as a consequence, I think raises important issues for all cultural guardians. We should, as does Jeanneney, applaud Google's bravery in tackling the big challenge of digitizing the world's printed literature. His measured appreciation for Project Guttenberg, which began early and has long persisted in providing ever more authoritative full text transcriptions; for the Open Content Alliance, which seeks to balance the intellectual property interests of owners with the desires of users for access to content; and for the Million Books project, which (albeit for some base economic reasons) is digitizing huge numbers of Chinese and Indian publications, remind us that even in America we have alternatives. And his account of the emerging European Digital Library, European search engines and model library digitization endeavors from libraries throughout the EU, though slightly self-serving, demonstrate that other models are being realized, in part in opposition to Google.

In the glare of publicity surrounding Google Book Search and other mass digitization projects focused on print culture, we should not lose sight of the small proportion of culture that publication represents, the problems of ceding its control to a private firm, Google's unfortunately incendiary approach to intellectual property, the poor quality of the digital capture we have seen to date, the limits of search and presentation as performed in this one service and the restriction that Google applies to other potential value-added uses, or the significant problem of cultural bias exacerbated by Google's advertising business model. Ian Wilson calls our attention to five principles enumerated by national librarians of la francophonie meeting in Paris on February 28, 2006: free access to publicly owned resources; non-exclusive agreements with content providers; capture of preservation standard images with assurances for long-term accessibility; protection of the integrity of original source materials; and provision of multi-lingual, multi-cultural access [17]. Jean-Noël Jeanneney has done us all a service by reminding us to look under the hood and hold Google, and those providing content to it, accountable. In the two years since Google first announced its ambitions, I think the D-Lib community has largely given Google the benefit of the doubt; now that some results are visible and the implications are more clear, I think it's time to publicly endorse open access to rights-cleared, high quality, scanned page images and reconsider the appropriate roles for academic and public institutions participating in commercial analogue heritage conversion efforts that don't contribute to this end.

[5] On p.11, Jeanneney recounts searching Google Books in May 2005 for Victor Hugo, Dante, Cervantes and Goethe and finding only English language translations of the literary titans of each of these major European languages.

In the case of the "Histoire de la Revolution de France" retrieved in my search (in fn 10 below), I got only as far as page 23 in a poorly scanned book before reaching the first page that was completely illegible because the operator had moved the page during the scanning process.
A random sample persuaded me that the problem is widespread, but it probably isn't culturally biased. Google's scanning process seems to be an equal opportunity book defacer  looking to see if important American authors were better treated, I randomly selected Emerson's Essays digitized by Google from a Harvard copy and found >5% of the pages were partly illegible to a person and probably more than 10% would be partly illegible in OCR.

[7] Gorman, Michael, LA Times December 17, 2004 "books are designed to be read sequentially and cumulatively".

[8] This search, repeated by the author on October 21, 2006, for "Miguel de Cervantes" produced the first Spanish language text in the third screen, so it may be that as Google Book Search grows the bias identified by proponents of cultural diversity increases!

[9] Google and the Myth of Universal Knowledge: A View from Europe, p.70-71.

[10] My search, October 20 2006, for 'france revolution' (a search that could be the same in French or English, demonstrated that Google makes the link to "French revolution" easily, but produced nothing in the French language until the 239th hit when it showed me p.37 of "Histoire de la Revolution de France" (by A.F.Bertrand de Moleville, 1803) which actually does not include the words I used in my query, though they appear on the two title pages and some other pages leading up to p.37. Using the correctly accented form of the word in French "révolution" produced the first reference in French on the 6th page, preceded by more than 50 English language books that did not use the term with an accent at all!

[12] Download a full-text from Google and you get a cover page trying to justify why they have placed technical restrictions on automated querying and restrict you to personal, non-commercial uses and to including their "watermark" on each page.

[13] The first principle for participation is "The OCA will encourage the greatest possible degree of access to and reuse of collections in the archive, while respecting the rights of content owners and contributors." <http://www.opencontentalliance.org/participate.html>.

[14] The St. Petersburg working group made this recommendation, and several others advocated in David Bearman and Jennifer Trant, "Converting Scanned Images of the Print History of the World to Knowledge: A Reference Model and Research Strategy", Russian Digital Libraries Journal, 2005, Vol. 8 # 5 <http://www.elbib.ru/index.phtml?page=elbib/eng/journal/2005/part5/BT>, to the World Summit on the Information Society in Tunis last summer, but no action was taken.

[15] Google and the Myth of Universal Knowledge: A View from Europe, p.80.

[16] American readers will be jarred by the segue from the Million Books project at Carnegie Mellon University in Pittsburgh to the unrelated Mellon Foundation in New York, p.37. In addition it lacks the benefit of an index, and suffers from poor translation of important technical distinctions, further confusing Jeanneney's already problematic discussion of digital imaging, scanning and OCR on p.54, for example.

[17] Google and the Myth of Universal Knowledge: A View from Europe, p. xii.