Google’s count of 130 million books is probably bunk

Google claims to have produced a count of the world's books, but given the …

Google's core Internet search technology famously grew out of a grad school project by Larry Page and Sergey Brin to index the world's books, and the modern Google Books Project actually touts itself as the part of Google that carries on the founders' original vision. So, when GBS, which has thrown high-powered computers, brilliant engineers, and millions of dollars at digitizing the world's books, claims to have come up with a reasonable count of the number of books in the world, who are we to disagree?

"After we exclude serials, we can finally count all the books in the world," wrote Google's Leonid Taycher in a GBS blog post. "There are 129,864,880 of them. At least until Sunday."

It's a large, official-sounding number, and the explanation for how Google arrived at it involves a number of acronyms and terms that will be unfamiliar to most of those who read the post. It's also quite likely to be complete bunk.

The ongoing GBS metadata farce

Google's counting method relies entirely on its enormous metadata collection—almost one billion records—which it winnows down by throwing out duplicates and non-book items like CDs. The result is a book count that's arrived at by a kind of process of elimination. It's not so much that Google starts with a fixed definition of "book" and then combs its records to identify objects with those characteristics; rather, the GBS algorithm seeks to identify everything that is clearly not a book, and to reject all those entries. It also looks for collections of records that all identify the same edition of the same book, but that are, for whatever reason (often a data entry error), listed differently in the different metadata collections that Google subscribes to.

But the problem with Google's count, as is clear from the GBS count post itself, is that GBS's metadata collection is a riddled with errors of every sort. Or, as linguist and GBS critic Goeff Nunberg put it last year in a blog post, Google's metadata is "train wreck: a mish-mash wrapped in a muddle wrapped in a mess."

Indeed, a simple Google search for "google books metadata" (sans quotes) will turn up mostly criticisms and caterwauling by dismayed linguists, librarians, and other scholars at the terrible state of Google's metadata. Erroneous dates are pervasive, to the point that you can find many GBS references to historical figures and technologies in books that Google dates to well before the people or technologies existed. The classifications are a mess, and Nunberg's presentation points out that the first 10 classifications for Walt Whitman's "Leaves of Grass" classify it as Juvenile Nonfiction, Poetry, Fiction, Literary Criticism, Biography & Autobiography, Counterfeits and Counterfeiting. Then there are authors that are missing or misattributed, and titles that bear no relation to the linked work.

Blaming the libraries

Nunberg is the most prominent of the GBS metadata critics, but many of the digital humanities scholars that I have talked with have raised the metadata issue any time GBS comes up in conversation. The concern appears to be widespread. But is it really Google's fault?

Google actually passes the blame for this situation on to the libraries by pointing out, as it does in the book counting post, that the company gets its metadata from libraries. Nunberg responded to this in a presentation last year with, "yes, sometimes... but libraries didn't classify Hamlet as 'antiques and collectibles' or Speculum as 'Health & Fitness'. Libraries don't use BISAC headings like 'Antiques and Collectibles' and 'Health & Fitness' in the first place...And publishers didn't assign BISAC codes to books published before the 1980's."

Contrast this with the view of Eric Hellman, a blogger who covers digital library issues. Hellman agrees with Google that most library metadata collections are in sorry shape to begin with, and he suggests that Google might actually improve the situation if the company can become a one-stop shop for the world's book metadata.

It's also the case that, aside from any library- or Google-induced metadata errors, publishers themselves can be remarkably careless about how they mark different editions of the same work. Editions of important works that can only be told apart by an examination of signature changes in their text are the stuff of bibliophile lore. And how many errors must be corrected and subtle fixes made in between printings before a "new printing" gets promoted to a "new edition"—the answer can vary from publisher to publisher and from work to work.

Whoever's to blame for the sorry state of GBS's metadata, no one disputes that the problems are many and endemic. Indeed, much of the Google blog post on the book count is taken up with exactly this issue—i.e., how to deal with the flood of bad, library-generated metadata infesting its records collection. Google's counting algorithm is an attempt to make the best of an awful situation, but Taycher's description of it doesn't inspire confidence in the final output, especially given where GBS's metadata problems seem to be clustered.

Google's process-of-elimination-based counting algorithm assigns different weights to different kinds of metadata, and Taycher indicates that publication dates play an important role in helping Google sort out the mess. But publication dates are the area of Google's metadata collection that scholars find to be the least reliable. Given the pervasiveness of the problems highlighted by Nunberg and others, it's hard to credit any sort of count from Google when a basic piece of information like publication date—a piece of info that's also typically present in the scan itself—is so often wrong.

Can engineers do art history?

In the end, most of the "metadata problems" that Google's engineers are trying to solve are very, very old. Distinguishing between different editions of a work, dealing with mistitled and misattributed works, and sorting out dates of publication—these are all tasks that have historically been carried out by human historians, codicologists, paleographers, library scientists, museum curators, textual critics, and learned lovers of books and scrolls since the dawn of writing. In trying to count the world's books by identifying which copies of books (or records of books, or copies of records of books, or records of copies of books) signify the "same" printed and bound volume, Google has found itself on the horns of a very ancient dilemma.

Google may not (or, rather, certainly will not) be able to solve this problem to the satisfaction of scholars who have spent their lives wrestling with these very issues in one corner or another of the humanities. But that's fine, because no one outside of Google really expects them to. The best the search giant can do is acknowledge and embrace the fact that it's now the newest, most junior member of an ancient and august guild of humanists, and let its new colleagues participate in the process of fixing and maintaining its metadata archive. After all, why should Google's engineers be attempting to do art history? Why not just focus on giving new tools to actual historians, and let them do their thing? The results of a more open, inclusive metadata curation process might never reveal how many books their really are in the world, but they would do a vastly better job of enabling scholars to work with the library that Google is building.

33 Reader Comments

> Or, as linguist and GBS critic Goeff Nunberg put it last year in a blog post, Google's metadata is "train> wreck: a mish-mash wrapped in a muddle wrapped in a mess."

Bah, his complaint was precisely the out-of-date elitist arguments people make for keeping print newspapers. His primary complaint is that the "info" section was inaccurate, which it is, but that's only an argument if anyone actually uses it. I don't, and GBS is my primary source of information of all sorts. Fixing this is polishing the brass on the Titanic.

I trust Google's count many, many times more than I trust old-tech systems. At least Google can directly compare text, for instance. THAT will let you know if two books are the same, and the only reason we didn't do this before is because we couldn't.

> Or, as linguist and GBS critic Goeff Nunberg put it last year in a blog post, Google's metadata is "train> wreck: a mish-mash wrapped in a muddle wrapped in a mess."

Bah, his complaint was precisely the out-of-date elitist arguments people make for keeping print newspapers. His primary complaint is that the "info" section was inaccurate, which it is, but that's only an argument if anyone actually uses it. I don't, and GBS is my primary source of information of all sorts. Fixing this is polishing the brass on the Titanic.

I trust Google's count many, many times more than I trust old-tech systems. At least Google can directly compare text, for instance. THAT will let you know if two books are the same, and the only reason we didn't do this before is because we couldn't.

Maury

It's hardly elitist to request accurate metadata, such as publishing date and author, and genre categorization which is not completely misleading and/or geared toward B&N stores. While massively erroneous data may not be problematic for identifying and working with a single text, it becomes extremely problematic if one wants to do mass computation or analyze the body of work.

Thanks Stokes for drawing some more attention to this problem. Your conclusion sounds reasonable too -- I think most humanists would be really happy to get their hands on some nice tools to properly describe these texts!

There are really two issues here. First, is the horrible handling and parsing of data within this project. Second is a lot of librarians, historians, and other textually focused anthropology folks that know a good computerized solution means the marginalization and obsolescence of their current field of study and sector of employment (most likely it will become either much easier or much harder depending on whether you choose to focus on known works or continue to try to find non-indexed texts). They will (and probably should) rail against every error no matter how small. This will be a good thing. Their aim of some may be swaying public trust away from technology solutions to the field (self preservation really), but their effect will be continued refinement of the tool. Interaction with literary historians will hopefully end up being a litmus test for any solution Google wants to bring to the table. Please the more honest ones and you know you're doing well.

One has to ask what's the point in making a worldwide and historical book count. That's a pipe dream, if there is one. It's like counting hairs on someone's head by trying to look at each one individually and claim after first count that's the definite number. For instance, Google's count is wrong. By at least 5. There's no reference to:

- All four editions of edition of "Este Livro que vos Deixo" by Vitalino Martins Aleixo, son of the poet - printed by Tipografia Guerra and published by Loulé. Last edition is dated 1977. (Poetry)

- 1988 edition of "Luz e sombras de uma vida" by Regina Madureira, published by the author herself (Romance)

And these are just the two first searches I did based on my own personal library, assuming books I would know would be hard for Google to have listed.

Sorry Google. Nice project. But tone down the marketing-oriented claims, please. I'd be more happy that way because it would clearly indicate a genuine and unadorned cultural interest.

Another problem with this is, what about rare books? Books that aren't in a library, haven't been in print for decades if not centuries, and might only exist in a few collectors libraries?

And, given the notoriously bad handling of Asian languages by google, are they counting all 5000+ years of Chinese literary history? Works from ancient Japan? What about Egyptian scrolls?

They don't make mention of this in the blog post, but are they claiming this is an "ever" figure, or simply the number of "books" currently extant? And, if the latter, is it just those in print, or does it include those rare out of print books that might only have 3 copies left in the world?

This whole post is just horn blowing, and it's obnoxious as hell. It's definitely a GIGO situation, and it annoys me that google is playing on its reputation to make this claim, and most of the population will accept it as fact.

Who knew it took 14 paragraphs to say "They have bad metadata. The number, while probably within 10% is wrong, and they shouldn't have included so many places to make it look like it was more accurate than it was."

I think it's awesome that you could write a post this long about book metadata and not once mention the largest supplier of book metadata in the world -- OCLC. Go look at the WorldCat.org database, which is set to pass 200,000,000 records any day now, of which about 85% are for books. OCLC is one of the suppliers of book metadata to Google. Oh, and by the way it's _Eric_ Hellman. So much for getting YOUR metadata right.

What amazes me is how offended people are on this topic, like how dare Google even attempt to answer such a question. Well Google has made it a habit of answering a great many questions that people previously thought impossible to answer.

I read the entire Google blog post where they explain their methodology because I too thought that making such a claim was pretty bold. I'm not an expert, but I was satisfied by the approach they took despite the nature of library metadata, which anyone who witnessed the dewey decimal process at their school library growing up would know has to be a shambles.

The examples Mr. Stokes cites as to why the count is "probably bunk" seem like the anecdotal one off criticisms of someone not willing to tackle this messy problem themselves. There is no solid data to back these counterclaims, only that Google must not have done something or another incorrectly, and shame on them for spending time on this and not providing "new tools to actual historians."

This is the best guess Google has given the current data, which they have only had to attempt to wrestle with as a result of the book scanning project. Is their number off by 10%? Most certainly. Is it off by 100%? Not likely. In my mind it gives a good lower bound perhaps of the number of books, and the mere attempt at estimating the number seems like a noble goal to me, one which librarians should be very interested in knowing, unless of course they are innumerate luddites which may well be the case.

We have a count and a mostly working system, and I assume a plan for an iterative process to make things better.

It's just like when the old school music pirates had to go back in and the ID3 information for rips from before CDDB. It may suck now, but eventually, it will suck less, and at some point, while not perfect, it will silently move from sucking to indispensable.

It's hardly elitist to request accurate metadata, such as publishing date and author

Do you expect accurate data on web sites? You might _want_ it, but do you find it at all accurate? So what you do is ignore it, and use completely different methods to characterize the web. You use Google.

walker222 wrote:

it becomes extremely problematic if one wants to do mass computation or analyze the body of work

Only if you think you should analyze it using metadata. That's soooo 20th century!

Index everything, run statistics. Faster, provides any answer you want, and is almost certainly more accurate.

BTW it's not elitist to request metadata, it's elitist to suggest that Google Books should have it. It's not what Google Books is, or wants to be, nor should it aspire to it. If metadata crawling of dead trees is important, use the systems that already exist for doing that. It's not Google's fault they all stink.

I'm with Geg and a few others. Google has set out on an nigh impossible task. Criticism is good to help them, but having a catalog of every book ever (or at least as many as we can find) would be fantastic. I spent hours and hours in college trying to just find books with the various sources available now, at one of the largest libraries in the country, and it was a frustrating process. Anything to help is a step forward.

As a side note, I know how hard it is to organize 60 gigs of music across 3 computers. 200+ million books? Godspeed.

One has to ask what's the point in making a worldwide and historical book count. That's a pipe dream, if there is one.

So is making a worldwide population count. But we do that all the time.

Quote:

For instance, Google's count is wrong. By at least 5. There's no reference to:

There's no reference to the person that was just born, but that doesn't stop one from making an estimate of the number of people in the world to the units. They clearly stated how they take these issues into account.

Even if the number is wrong, it is an interesting problem to tackle that will expose data issues in google books and, judging by some of the comments, libraries at large. This is precisely why it sounds like something work tackling.

If metadata is the problem, and the errors are glaring, then this seems like something that could be improved with a crowdsourcing approach. The correctness of the metadata would have wikipedia-like characteristics, but that's not all bad. In my experience, there's a lot more people with a dedicated interest in organizing their book collections than there is in writing encylopedia articles, so participation rates could be fairly high.

I don't think art historians or book scholars need to worry about their jobs becoming obsolete any time soon. One thing the Google count shows is that there is plenty of work to be done and it will take a very long time and quite a few people to do it.

There's no reference to the person that was just born, but that doesn't stop one from making an estimate of the number of people in the world to the units. They clearly stated how they take these issues into account.

If you want to believe Google found the total number of books in the world within an certain margin of error, I won't try to dissuade you. Have better things to do.

Very few things surprise me anymore. Gullibility is the 21th century consumer fad.

Why in the world do you think appropriate metadata isn't important for common usage? If you want Google, with it's reputation for useful information, to publish an index of books they claim to be reasonably accurate, it's only a matter of time before the average college student accidentally Wikipedias the erroneous information and sticks it in his essays, and...

It's not elitist for academia to request that high use sites with increasingly relevant importance contain accurate information for their purposes. It's ludicrous to suggest that a very prominent company try to index the world's collection of books without changing anthropological studies, making it important to the historian to push the collection to be accurate. I for one wouldn't want a student writing an essay on Leaves of Grass believing it was a children's novel. That's a travesty worth essay-tearing, and you yourself said you trust digital sources more than their papier companions. (Although worldcat is not very paper at all... that doesn't matter, because worldcat is not the first google result.)

Edit: oh yeah, and on the subject of being able to compare text to the specificness of being able to tell apart editions, to be of any use to the academic community for historical analysis of works, you're going to need OCR technology that's about 1000 years ahead of the shite google is using. (or any existant program, for that matter.) For example, look at gutenberg. Distributed Proofreaders has to have people proofread each OCR'd novel i think three times before sending it to gutenberg, because OCR is so error-prone.

Indexing works like you say would take a hell of a lot more work than creating a perfect metadata library, to get the accuracy needed for anthropological and literary studies. It is for professional, academic reasons (or for those who are OCD about their collections) the info tab is important, not the layman.

Only if you think you should analyze it using metadata. That's soooo 20th century!

Index everything, run statistics. Faster, provides any answer you want, and is almost certainly more accurate.

BTW it's not elitist to request metadata, it's elitist to suggest that Google Books should have it. It's not what Google Books is, or wants to be, nor should it aspire to it. If metadata crawling of dead trees is important, use the systems that already exist for doing that. It's not Google's fault they all stink.

I don't have a problem with Google, and copyright-issues aside, they are performing an admirable job with this. And the metadata will get better with time.

But there's definitely room for improvement. Automated systems can handle more volume but are less reliable than humans. If you read the Nunberg post, he's pointing out the dangers of blindly indexing everything. If you're searching for books published in 1875, do you want data based on bookplates and in-book advertising? But since everything is OCR'd, Google could work with librarians to iteratively devise better algorithms and regenerate metadata more accurately.

If I understand the article and the Google Books blog correctly the huge collection of metadata input is what Google has collected from as many metadata sources as possible. It seems to me like the people whining are the ones responsible for creating the huge pile of metadata garbage to begin with. (Or rather, their predecessors.)

Perhaps it's time to start mixing in a bit of engineering into the mix and see what happens? Obviously unaided "crowd sourcing" isn't the solution to the problem. Because that's what we've been trying for 7000 years and it caused this problem in the first place.

I also don't get what is up with the antagonistic tone in this article. Reading it is seems like these historians feel that GBS is satan and has come to claim the souls of all of humanity. Surely that can't be the case?

romnempire wrote:

Why in the world do you think appropriate metadata isn't important for common usage? If you want Google, with it's reputation for useful information, to publish an index of books they claim to be reasonably accurate, it's only a matter of time before the average college student accidentally Wikipedias the erroneous information and sticks it in his essays, and...

College students do stupid shit all the time. Somethings they get flunked for. Reading a book and not being able to determine if it's for children or adults can not be the most important problem here.

But metadata is important, and that's why the Google Books blog finished off by saying that it's a work in progress.

romnempire wrote:

Edit: oh yeah, and on the subject of being able to compare text to the specificness of being able to tell apart editions, to be of any use to the academic community for historical analysis of works, you're going to need OCR technology that's about 1000 years ahead of the shite google is using.

According to Googles predictions their word error rate is 1-10%. That's not perfect but it's something which can be improved. And once it's improved running the new algorithms is a lot cheaper since you don't have to rescanning all the books.

If I understand the article and the Google Books blog correctly the huge collection of metadata input is what Google has collected from as many metadata sources as possible. It seems to me like the people whining are the ones responsible for creating the huge pile of metadata garbage to begin with. (Or rather, their predecessors.)

Perhaps it's time to start mixing in a bit of engineering into the mix and see what happens? Obviously unaided "crowd sourcing" isn't the solution to the problem. Because that's what we've been trying for 7000 years and it caused this problem in the first place.

I also don't get what is up with the antagonistic tone in this article. Reading it is seems like these historians feel that GBS is satan and has come to claim the souls of all of humanity. Surely that can't be the case?

In contrast to many here, I found nothing at all antagonistic about the tone of this article. It pointed out repeatedly that Google acknowledges every problem with metadata and with their count that critics have raised, for example, and that seemed to me to be giving Google as much benefit of the doubt as possible (i.e. this article isn't saying 'Google's full of morons because they don't even realize that they're dealing with problem XYZ which is an obvious one'). The only 'antagonistic' thing I saw was the suggestion that the tone of Google's post, and perhaps the thrust of their project, should be more along the lines of 'here's what we've managed to do with the data-set we've got, and here's how we did it, and that means that other people who want to do similar things can do XYZ to achieve ABC...'. That is, Jon Stokes seems to feel that Google would be better off acknowledging the problems present, and discussing and acting on methods and technologies for approaching the problems, rather than declaring that they've got anything like a 'final result' (which is what the count of books sounds like it's trying to be, even though everyone knows that it's not, and that Google's staff would be foolish to think that it is).

Hast wrote:

romnempire wrote:

Why in the world do you think appropriate metadata isn't important for common usage? If you want Google, with it's reputation for useful information, to publish an index of books they claim to be reasonably accurate, it's only a matter of time before the average college student accidentally Wikipedias the erroneous information and sticks it in his essays, and...

College students do stupid shit all the time. Somethings they get flunked for. Reading a book and not being able to determine if it's for children or adults can not be the most important problem here.

Metadata problems can be very important ones. Mismatched or incorrect metadata can be the difference between an instructor deciding that a student has simply made up citations or has actually read and correctly cited references. In the classes I taught (at a large university in the USA), the former is at best a grade of 0 on the project, and potentially more significant depending on circumstances. Taking into account the possibility of student error is something every instructor needs to do. Now they all need to take into account the very real possibility that the student didn't make the error, but that they're only passing on an error from somewhere else (or, they could just be making stuff up, still).

In short, there are at least two significant consequences that I can think of off the top of my head:1. mistakes can cost students substantial portions of their grade in a class, and not because they or their instructors got anything wrong.2. the presence of error-ridden metadata, combined with tools for indiscriminately accessing it, makes the job of research and grading harder for everyone who interacts with that data, and I think that we can agree that making it harder to complete a research paper, or to grade 25-30 of them (most of my classes had about that many students), is not a trivial problem.

Hast wrote:

romnempire wrote:

Edit: oh yeah, and on the subject of being able to compare text to the specificness of being able to tell apart editions, to be of any use to the academic community for historical analysis of works, you're going to need OCR technology that's about 1000 years ahead of the shite google is using.

According to Googles predictions their word error rate is 1-10%. That's not perfect but it's something which can be improved. And once it's improved running the new algorithms is a lot cheaper since you don't have to rescanning all the books.

fwiw, I think that a 1% error rate in OCR is a serious problem for most academic researchers, based on my experience working with handwritten manuscripts and scanned books and papers. 10% would have me looking very, very hard for an alternate source of the information, or simply re-doing the transcription myself. Better OCR is welcome (and expected) but I haven't heard of anything that's likely to change error rates by an order of magnitude, and anything less than that still leaves us with significant problems for academic researchers.

All IMO, of course, and there is obviously a lot of room for legitimate differences of opinion on many aspects of this issue.