DIGITIZATION AND ITS DISCONTENTS.

That’s the subtitle of an article by Anthony Grafton in the latest New Yorker on the business of putting books and other written material online. Grafton begins with the wonderful writer Alfred Kazin and his encomium to the New York Public Library (“Anything I had heard of and wanted to see, the blessed place owned…”) and goes on to Google Books and its mission to “build a comprehensive index of all the books in the world” and the wild-eyed “millenarian prophecies” it has spawned (“Last year, Kevin Kelly… predicted, in a piece in the Times, that ‘all the books in the world’ would ‘become a single liquid fabric of interconnected words and ideas'”), continuing with a compendious history of libraries, cross-referencing, and abridgment that includes such tidbits as Jacques Cujas’s “rotating barber’s chair and movable bookstand that enabled him to keep many open books in view at the same time.” Then he gets back to Google, Microsoft, and other players in the digitization game, and explains why “The supposed universal library… will be not a seamless mass of books, easily linked and studied together, but a patchwork of interfaces and databases, some open to anyone with a computer and WiFi, others closed to those without access or money.” My main complaint is that the piece isn’t longer; at four pages, it’s a mere appetizer. Fortunately, the online version of the magazine has a sidebar that “points to some favorite archives and historical resources,” and among the links to be found there is one that addresses my subsidiary complaint, that he doesn’t complain enough about the wretched failures of Google Books, Robert B. Townsend’s AHA blog post Google Books: What’s Not to Like? Townsend lets them have it:

Over the past three months I spent a fair amount of time on the site as part of a research project on the early history of the profession, and from a researcher’s point of view I have to say the results were deeply disconcerting. Yes, the site offers up a number of hard-to-find works from the early 20th century with instant access to the text. And yes, for some books it offers a useful keyword search function for finding a reference that might not be in the index. But my experience suggests the project is falling far short of its central promise of exposing the literature of the world, and is instead piling mistake upon mistake with little evidence of basic quality control.

He details all these failings, except for my particular bugbear, which is addressed in the first comment on his post:

You didn’t mention my pet peeve. In my work, I need to basically fact-check some historical info. The snippet view for copyrighted works would be, if not ideal, then sufficient for my objectives. That is, if the snippet actually included the search terms requested with a little surrounding text. However, more often than not some text other than what one asked for is highlighted, but one can’t, of course, scroll up or down in the snippet to see adjacent passages. So one is left wondering: now what? This is now more than just incidental. I’ve reported it to Google and they respond that it’s still beta so be patient.
— Jim Roan

You tell ’em, Jim! That “snippet view,” more than any other single thing, makes me dislike Google, and for years I never thought I would have any reason to dislike Google. Shape up, guys—make sure your snippets at least include the searched-for material and provide full view for the out-of-copyright stuff, and we’ll love you with that unreserved love you’ve gotten used to; keep pushing these defective goods and someone else will come along and do it better.

(Thanks for the links, Paul!)

Comments

Patience is indeed the correct advice. They (or we, but I have nothing to do with Google Books) must tread very lightly indeed to avoid being bitten by the copyright monopolists. And the problem’s much harder than it appears to be.

I resisted contriving an opportunity to produce the usable line Umpires of the Word, when we had discussion of Ostler’s book Empires of the Word, with discussions of contested usages nearby. But I cannot now resist offering the line Civilisation and its Disk Contents, which I’m sure several of us have coined independently. All I can cling to is this, in respect of my own claim to priority: Google only has it spelt with a z.
The issue, in any case, is serious. Everything is changing so fast: so fast that even to remark on this fact is long passé. We have this “snippet view” of just about everything, and there’s no turning back to those antique ways of apprehending language, music, and so much more that many of us LH regulars remember with hopeless nostalgia. I don’t say it’s all a bad thing, of course! Look at us here, with the pages of the world seeming to be laid out before us for perusal, summoning forth our wit and erudition – and yes, our shared delight. But the morbid thought that lurks in wait is this: we can never go back. Our books will never again be what they were for us. As for the online world’s luminous promise of some hazy-edged replacement, I am reminded of Leonard Cohen’s great lines – one year after Apollo 11:Ah, they’ll never, they’ll never ever reach the moon,
at least not the one that we’re after.
[“Sing Another Song Boys”, Songs of Love and Hate, 1970]

I have to confess I don’t grasp the idea of Google Books at all. What is it for? Just to find the source of a quotation? I can easily do it using just Google web search.
I spent some years working at a local library of British Council as an IT guy, but the experience of the front office work gives me reasons to secretly flatter myself by thinking I am a librarian, in some way. Google was my best friend at the reference desk, but I can’t recall a single occasion when I would miss Google Books…

Google Books is great; I just downloaded a complete key source (from 1724) for my PhD thesis. It looked as though Grafton was going to start blogging for a while, but he never did. Anyway, LH, his History of the Footnote is recommended, if you haven’t read it; at least, the first half is.

Just as a small example, Google Books provides ready and immediate access to what Grafton calls the longest footnote ever. I’m pretty sure some local libraries have this work, but I might not have bothered just for a few minutes of delight reading it. This is at least as significant as being able to access something so rare that it would have been impossible to obtain physically.
In an entirely different vein, take BibliOdyssey, a site of found images from book and manuscript digitization efforts around the world. No single set would be impossible without online access, but each would be such an undertaking that no such assemblage could result.
If all these efforts together realized their goals, high quality scans with accurate OCR of every book ever produced, with full access for out of copyright and bit-by-bit (liberal automated fair use) for newer, I’d have no complaints. As others have said, the risk is that a crappy job of it is being done now and no one will ever have the incentive to go back and redo it. Ever.

Digital libraries will change the world, of course. But it is important to remember that we are presently in the early stages. One upside prediction that can safely be made is that search engines will eventually be able to search even facsimile/scanned texts. This will do away with the clumsy keyword system associated with such texts at present.
On the downside, such large (in this case huge)Internet projects are driven by profit motive. Our free e-mail provider, for another example, requires us to “click-off” on the conditions stated in a “Terms of Service” agreement. Among the items that we must agree to is that the “free” status of our service can be changed at any time upon notice. The providers do not all include this for no reason. We are being trained to depend upon e-mail and once the training has reached the point at which the world will feel the need of the service, and the vast amount of available server space begins to be strained, or necessary in order to provide other popular new free services for training, we will find ourselves paying for e-mail. What a service-provider offers for free today it expects to turn a profit on in the future.
We are fortunate to live at the beginning of the net in one way. At this stage, it is in the immediate interest of Internet companies to offer a considerable portion of their virtual product-line for free. Even during the past five years, however, the percentage of data bases that require payment for access has been growing. Should that payment not be direct it must trickle down to each individual user after one fashion or another, usually through a primary customer which must somehow pass its cost along and make a percentage to boot (regi$tration at a college or fee $chedules).
For the time being, though, a huge number of books have gratifyingly been added to our libraries for free. A mere twenty years ago, the resources we have available to us now could not have been imagined. We’re in Wonderland. In another twenty, they will likely be unimaginable again.
Of course, if we fail to convert from an oil-based economy to a diversified energy plan, over the same twenty years, I wouldn’t sweat the data base fees.

As Roan noted, it’s still in Beta. Give it 10 minutes or so. GB’s various little problems will get worked out; I’m sure there’s a continuous improvement process in place. The snippet is there because publishers have imposed copyright restrictions, just as the music industry did, initially, against Napster and the like. But now you can download music essentially by the snippet, legally, for a small fee per snippet. Sooner or later, Google Books will offer the same thing, I predict: you’ll get the snippet view, and then have the opportunity to click through to a page or multiple pages in exchange for a micropayment measured in pennies (which will probably vary depending on the book), or a subscription fee (which would vary depending on the volume of your usage). These micropayments would represent a new and incremental revenue stream to publishers and authors. Researchers who don’t wish to pay would still use GB for free to locate sources, and could then go to the library to look up the book.

Having read this post, I tried to find an arbitrary text on Google Books. I closed my eyes, picked a random book from the bookshelf and it was Tartarin of Tarascon by Alphonse Daudet. GB shows 683 hits. I checked some links, but neither of them gave me the full text of this doubtlessly out-of-copyright work. Google web search gave me a link to Wikepedia on the first page, where I found a reference to the Gutenberg project. There it is.
I know I just don’t understand what is copyright. But how can a book written in 1872 be marked as “copyrighted material” just because it’s scanned from a later edition? Makes no sense to me.

If you want Full view, you must click that in your search. GB has about a dozen full text downloadable and searchable copies of Tartarin of Tarascon by Alphonse Daudet. Here are the easy to find ones: two in English, four in French; there are more where the title is Works of and so on, if a particular edition / translator is needed.
The business of scanning a recent PoD facsimile of an out-of-copyright work and thereby freezing access is indeed a big problem. As is just getting the date wrong.

> Here are the easy to find ones
“Your search – intitle:Tartarin intitle:Tarascon inauthor:Alphonse inauthor:Daudet – did not match any complete books.”
Hmm… It must be the retribution for living in Russia, where the books are available online no matter if they are copyrighted or not 🙂

Google Books CAN be fantastic – I used it yesterday on a project to find out when and how roast barley began to replace roast malt as a beer flavouring/colouring (hey – that’s why my pseudonym is what is it, OK?) and pulled up references in five minutes or less sitting at my desk that I wouldn’t have found without at least a day travelling in to the British Library – if then. Through Google Books I’ve found gems I never knew existed – Joseph Banks recording in his Endeavour journal eating Cheshire cheese and drinking porter two years out from London in the middle of the Pacific, for example.
BUT – I endorse all the comments that have already been said about its frustrations. If I put in “author Stopes title malt and malting” it tells me no preview of the book is available, though it was published in 1885. If I put in text from the copy of the book behind me on the shelf, Google Books will then deign to provide a “snippet view” that, yes, doesn’t actually centre on (or even properly show) the words I entered – and why only a snippet?
As a regular user of Google Books for research, I find many of the “it’s out of copyright, why can’t I have a full view” occasions seem to be associated with particular suppliers of books to Google for scanning – University of Michigan being one, Oxford University being another.
The unofficial Google Books users’ group appears to have just two members – maybe it’s time for some more to join up … as a project it does indeed promise so much, and it occasionally delivers treasure, but it could be so much better …

Two members and “no topics.” Too bad; I’d love to join a functioning community where I could bitch and discuss. (And needless to say I share your mixed feelings; if it weren’t so fantastic I wouldn’t be so enraged by its failings!)

The Internet will do much to redress this imbalance, by providing Western books for non-Western readers. What it will do for non-Western books is less clear.

So now I’m wondering about that second part in particular.
Suppose we ignore manuscripts and archives and limit ourselves to books produced by some means that lends itself to distribution. What fraction of the books in the world are Western?
Further suppose we ignore local digitization efforts, or define Western precisely to mean those places where there are significant digitization efforts. What fraction of the non-Western books have at least one copy in a Western library?
I honestly don’t have any intuition. I’m not even sure where to look. I can find various facts and figures, but nothing that takes this particular perspective.

Google Books and similar efforts have digitized maybe a couple million books.

OCLC says its member libraries have 16 billion volumes. But it doesn’t seem to say anything about distinct books.

Bower GBIP has around 15 million books, but that’s mostly in English and Spanish, right?

And you can support my book habit without even spending money on me by following my Amazon links to do your shopping (if, of course, you like shopping on Amazon); I get a small percentage of every dollar spent while someone is following my referral links, and every month I get a gift certificate that allows me to buy a few books (or, if someone has bought a big-ticket item, even more). You will not only get your purchases, you will get my blessings and a karmic boost!

If your comment goes into moderation (which can happen if it has too many links or if the software just takes it into its head to be suspicious), I will usually set it free reasonably quickly... unless it happens during the night, say between 10 PM and 8 AM Eastern Time (US), in which case you'll have to wait. And occasionally the software will decide a comment is spam and it won't even go into moderation; if a comment disappears on you, send me an e-mail and I'll try to rescue it. You have my apologies in advance. Also, my posts should be taken as conversation-starters; there is no expectation of "staying on topic," and some of the best threads have gone in entirely unexpected directions. I have strong opinions and sometimes express myself more sharply than an ideal interlocutor might, but I try to avoid personal attacks, and I hope you will do the same.

Favorite rave review, by Teju Cole:
"Evidence that the internet is not as idiotic as it often looks. This site is called Language Hat and it deals with many issues of a linguistic flavor. It's a beacon of attentiveness and crisp thinking, and an excellent substitute for the daily news."

From "commonbeauty"

(Cole's blog circa 2003)

All comments are copyright their original posters. Only messages signed "languagehat" are property of and attributable to languagehat.com. All other messages and opinions expressed herein are those of the author and do not necessarily state or reflect those of languagehat.com. Languagehat.com does not endorse any potential defamatory opinions of readers, and readers should post opinions regarding third parties at their own risk. Languagehat.com reserves the right to alter or delete any questionable material posted on this site.