Google’s count of 130 million books is probably bunk

Google claims to have produced a count of the world's books, but given the …

Google's core Internet search technology famously grew out of a grad school project by Larry Page and Sergey Brin to index the world's books, and the modern Google Books Project actually touts itself as the part of Google that carries on the founders' original vision. So, when GBS, which has thrown high-powered computers, brilliant engineers, and millions of dollars at digitizing the world's books, claims to have come up with a reasonable count of the number of books in the world, who are we to disagree?

"After we exclude serials, we can finally count all the books in the world," wrote Google's Leonid Taycher in a GBS blog post. "There are 129,864,880 of them. At least until Sunday."

It's a large, official-sounding number, and the explanation for how Google arrived at it involves a number of acronyms and terms that will be unfamiliar to most of those who read the post. It's also quite likely to be complete bunk.

The ongoing GBS metadata farce

Google's counting method relies entirely on its enormous metadata collection—almost one billion records—which it winnows down by throwing out duplicates and non-book items like CDs. The result is a book count that's arrived at by a kind of process of elimination. It's not so much that Google starts with a fixed definition of "book" and then combs its records to identify objects with those characteristics; rather, the GBS algorithm seeks to identify everything that is clearly not a book, and to reject all those entries. It also looks for collections of records that all identify the same edition of the same book, but that are, for whatever reason (often a data entry error), listed differently in the different metadata collections that Google subscribes to.

But the problem with Google's count, as is clear from the GBS count post itself, is that GBS's metadata collection is a riddled with errors of every sort. Or, as linguist and GBS critic Goeff Nunberg put it last year in a blog post, Google's metadata is "train wreck: a mish-mash wrapped in a muddle wrapped in a mess."

Indeed, a simple Google search for "google books metadata" (sans quotes) will turn up mostly criticisms and caterwauling by dismayed linguists, librarians, and other scholars at the terrible state of Google's metadata. Erroneous dates are pervasive, to the point that you can find many GBS references to historical figures and technologies in books that Google dates to well before the people or technologies existed. The classifications are a mess, and Nunberg's presentation points out that the first 10 classifications for Walt Whitman's "Leaves of Grass" classify it as Juvenile Nonfiction, Poetry, Fiction, Literary Criticism, Biography & Autobiography, Counterfeits and Counterfeiting. Then there are authors that are missing or misattributed, and titles that bear no relation to the linked work.

Blaming the libraries

Nunberg is the most prominent of the GBS metadata critics, but many of the digital humanities scholars that I have talked with have raised the metadata issue any time GBS comes up in conversation. The concern appears to be widespread. But is it really Google's fault?

Google actually passes the blame for this situation on to the libraries by pointing out, as it does in the book counting post, that the company gets its metadata from libraries. Nunberg responded to this in a presentation last year with, "yes, sometimes... but libraries didn't classify Hamlet as 'antiques and collectibles' or Speculum as 'Health & Fitness'. Libraries don't use BISAC headings like 'Antiques and Collectibles' and 'Health & Fitness' in the first place...And publishers didn't assign BISAC codes to books published before the 1980's."

Contrast this with the view of Eric Hellman, a blogger who covers digital library issues. Hellman agrees with Google that most library metadata collections are in sorry shape to begin with, and he suggests that Google might actually improve the situation if the company can become a one-stop shop for the world's book metadata.

It's also the case that, aside from any library- or Google-induced metadata errors, publishers themselves can be remarkably careless about how they mark different editions of the same work. Editions of important works that can only be told apart by an examination of signature changes in their text are the stuff of bibliophile lore. And how many errors must be corrected and subtle fixes made in between printings before a "new printing" gets promoted to a "new edition"—the answer can vary from publisher to publisher and from work to work.

Whoever's to blame for the sorry state of GBS's metadata, no one disputes that the problems are many and endemic. Indeed, much of the Google blog post on the book count is taken up with exactly this issue—i.e., how to deal with the flood of bad, library-generated metadata infesting its records collection. Google's counting algorithm is an attempt to make the best of an awful situation, but Taycher's description of it doesn't inspire confidence in the final output, especially given where GBS's metadata problems seem to be clustered.

Google's process-of-elimination-based counting algorithm assigns different weights to different kinds of metadata, and Taycher indicates that publication dates play an important role in helping Google sort out the mess. But publication dates are the area of Google's metadata collection that scholars find to be the least reliable. Given the pervasiveness of the problems highlighted by Nunberg and others, it's hard to credit any sort of count from Google when a basic piece of information like publication date—a piece of info that's also typically present in the scan itself—is so often wrong.

Can engineers do art history?

In the end, most of the "metadata problems" that Google's engineers are trying to solve are very, very old. Distinguishing between different editions of a work, dealing with mistitled and misattributed works, and sorting out dates of publication—these are all tasks that have historically been carried out by human historians, codicologists, paleographers, library scientists, museum curators, textual critics, and learned lovers of books and scrolls since the dawn of writing. In trying to count the world's books by identifying which copies of books (or records of books, or copies of records of books, or records of copies of books) signify the "same" printed and bound volume, Google has found itself on the horns of a very ancient dilemma.

Google may not (or, rather, certainly will not) be able to solve this problem to the satisfaction of scholars who have spent their lives wrestling with these very issues in one corner or another of the humanities. But that's fine, because no one outside of Google really expects them to. The best the search giant can do is acknowledge and embrace the fact that it's now the newest, most junior member of an ancient and august guild of humanists, and let its new colleagues participate in the process of fixing and maintaining its metadata archive. After all, why should Google's engineers be attempting to do art history? Why not just focus on giving new tools to actual historians, and let them do their thing? The results of a more open, inclusive metadata curation process might never reveal how many books their really are in the world, but they would do a vastly better job of enabling scholars to work with the library that Google is building.