Ministry of Innovation —

HP, UMich deal means a “real” future for scanned books

HP and the University of Michigan have inked a deal that will see HP …

In a retro twist on the Google Books idea, HP has announced a partnership with the University of Michigan library to sell physical copies of over 500,000 rare and out-of-print works, while making the digital versions available online for free.

HP's BookPrep service, currently in beta, will take in raw scans of books, clean them up to prepare them for re-printing, and then offer print-on-demand copies for sale via normal online book distribution channels like Amazon. This new arrangement mixes a number of aspects of existing efforts like Google Books and current print-on-demand (PoD) offerings, while being a little different from either, and in the process it points the way to a real future for the digital contents of libraries' special collections.

All scanned in and no place to go

The first way in which the HP/Michigan deal differs from Google Books is that HP itself is not doing the scanning. Instead, HP is taking advantage of the rare book scanning efforts that are already underway at Michigan—HP just takes Michigan's raw scans and turns them back into books. This basic idea has much wider applicability than just at Michigan, since libraries across the country are currently in the process of digitizing their special collections.

When I was at the University of Chicago, I seriously explored the idea of doing thesis work in the digital humanities. During that time, I learned that most special collections departments at libraries and museums are engaged in some type of high-quality digitization efforts of rare documents—books, scrolls, photographs, and other printed and handwritten matter. These projects generate huge amounts of high-quality image data, but there's currently no way for most of these collections to make that data available to the public. So that data sits unviewed in an archive somewhere, just like the special collection items it represents.

Google, Microsoft, Amazon—these companies should actually start ingesting that data and hosting it, but there are a number of reasons why this doesn't appear to be happening (that's another article, though). In the meantime, a PoD effort like the HP/Michigan collaboration is a good way to make some of this material available in a convenient format that doesn't involve designing a clunky Web-based interface for it.

The PoD aspect of the HP/Michigan effort isn't just about making books available in a convenient, universally accessible format—it's also part of the printer maker's ongoing attempt to keep people printing in the face of the nascent e-paper and e-book revolution.

"People around the world still value reading books in print," said Andrew Bolwell, HP's director of New Business Initiatives, in a press releases. HP clearly hopes that this statement will continue to hold true for some time to come.

Not ordinary PoD

HP's BookPrep is by no means the only PoD service in the world, nor is HP the only on-demand printer. PoD services like Lulu.com and Apple's iPhoto books have deals with on-demand printers that do the actual printing, binding, and shipping for them, and most large printing houses, like R.R. Donnelly, have print-on-demand services in addition to their traditional presses.

What separates BookPrep from the rest is that normal PoD shops take in only print-ready digital files, usually PDFs. BookPrep, in contrast, will take high-resolution scans that aren't fit to print, and automatically clean them up for printing. Take a look at the examples below from HP's BookPrep website, where the original scan is on top and the print-ready copy is below it.

This presentation problem is currently the number one barrier to getting most of the aforementioned special collections' material on the Web, even if the institutions that produced the scans could afford to host them (which they can't).

Making an interface that lets you usefully interact with high-resolution scans of papyri, books, handwritten notes, photographs, and the like is a massive undertaking, and there currently exists no off-the-shelf package designed specifically for this purpose.

The University of Chicago, for instance, uses software that was originally designed for medical images to present its "Archaic Mark" manuscript on the Web, but the experience isn't exactly on par with something like Google Maps.

A PoD effort like what HP has announced could be a less painful method for getting special collections material out to the public, at least until either Google or Microsoft realize that they should take this data and adapt online map services to display it. (There are a ton of similarities between book scans and map data, not the least of which is that both involve 3D datasets that are projected into 2D for Web use; yup, most book scanning is now 3D.)

Hopefully, HP will announce more such deals in the near future, because there are plenty more institutions that would love to take the terabytes of raw, high-resolution scans that are sitting on dusty hard drives and make them available to the viewing public.

14 Reader Comments

Someday, everything will be digitized. My dad spend 2 years buried in the stacks at the Bodleian Library at Oxford dusting off 16th Century burial records as part of his PhD research. His information will be the last to be scanned, but it will happen eventually.

Meanwhile, all the really cool and colorful manuscripts from Ye Olde World are going to be online.

I like the concept of digitizing as a way to saving space and getting the material out into the world more easily. The fact they take it a step further to also sell the books to the public could mean even more efficiency, by generating some revenue for the institution and reducing the costs of storage.

I also like the idea of people, not institutions, owning pieces of history. Individual collectors can buy books sentimental to them (like Frisbee's burial records), investors can collect them as assets or they can be lent to museums for more exhibits displaying never-made-public texts now that they have been released from the protective clutches of institutions.

So, the material is getting spread around in more than one way and I think that will benefit all of us a lot more.

Hopefully this will improve end-user OCR applications. Currently available (half-decent) OCR solutions continue to be overpriced for non-business use and are far too time-intensive when the source is not pristine.

These are called facsimile editions, and have been around for some time. A company called Scolar press does nothing but them. Pickering and Chatto do a lot.

The innovation here is 'on-demand'. I'm keen to see what the quality is like, because the book that I saw from the much-hyped Espresso Book Machine (not a facscimile, but a resetting) didn't look like it would last long, and was horrible in other ways.

If the output is durable, it solves one problem with digitization of historic books - guaranteeing the longevity of the output in a situation where conservation of originals is also making demands on library budgets.

Originally posted by FrisbeeFreek:My dad spend 2 years buried in the stacks at the Bodleian Library at Oxford dusting off 16th Century burial records as part of his PhD research. His information will be the last to be scanned, but it will happen eventually.

Sounds like a great topic! New doctoral theses at Oxford and most other places are, I believe, digitized and archived as a matter of course. Digital or paper copies of old ones can be ordered at fairly low cost.

quote:

Meanwhile, all the really cool and colorful manuscripts from Ye Olde World are going to be online.

You would not believe how many manuscripts in major libraries (including some very famous libraries in the New World) are not even recently catalogued, let alone imaged!

I'm stunned by how much better and cheaper digitization has become in the last couple of years, but some things can't be speeded up. You need to do a proper conservation analysis before imaging a book, and you ought to check the cataloguing. Print-on-demand could help to fund that work.

Here's a list Here's a list of Oxford digitization projects, but my favourite early manuscript resource is Scriptorium, Scriptorium, based at Cambridge.

they can be lent to museums for more exhibits displaying never-made-public texts now that they have been released from the protective clutches of institutions..

Not sure what you mean by 'never-made-public'? Almost all libraries will let any member of the public who can argue that they have a reason to see an original, rare book do so. For free, no less. Some libraries will let anyone with ID walk in off the street to consult unique and precious materials, and increasingly they will let them take pictures. Again, for free, as it should be.

For those who are merely curious, and just want to see the beauty of the book (totally legitimate reasons), most major libraries do indeed contribute to exhibitions, and often have exhibition spaces of their own. Of course, every book on display in a museum exhibit is unavailable for a reader to consult, so a balance needs to be struck there.

You are right to describe these institutions as 'protective', which is why we still have a certain fraction of the world's written heritage to learn from and to enjoy. 'Clutches' is a little unfair, though. The basic contradiction of the mission of the library is this: the library must keep books safe, and make them available. These are always in a precarious balance, and digitization/print-on-demand doesn't abolish it, although (like you say) it may allow new balances to be struck, which is why I share your enthusiasm for this technology.

I hope this means there will never be an information loss as devastating as the fire at the Library of Alexandria. Forget about the baby jebus, it would make me cry.My tag line on the newsgroups used to be "The Library of Alexandria should have had a mirror site."

Someday, everything will be digitized. My dad spend 2 years buried in the stacks at the Bodleian Library at Oxford dusting off 16th Century burial records as part of his PhD research. His information will be the last to be scanned, but it will happen eventually.

Meanwhile, all the really cool and colorful manuscripts from Ye Olde World are going to be online.

Ah, the Bodleian. Where items are organized by date of entry.

Those manuscripts are going to look cool, but no reproduction can match the brilliance of the originals. Gold leaf printers might help!

OCR is needed to search and cross-index source material. A pure graphic scan (of text) is much less useful.

Isn't the biggest challenge the digital format and storage medium?

Something stored on state-of-the-art 8" floppies back when is virtually lost today. 5.25" floppies are almost lost (none of my new PC's have those). Ordinary CD's and DVD's will be lost in the near future, just as soon as some next-gen multi-layer Blu-Ray format takes off in the consumer market with drives that are not backward compatible.

At the moment JPEG and TIFF are common graphic formats (hah, with multiple versions even today) but they could, someday, become an inscrutible binary format.

Originally posted by Carrie:I hope this means there will never be an information loss as devastating as the fire at the Library of Alexandria. Forget about the baby jebus, it would make me cry.My tag line on the newsgroups used to be "The Library of Alexandria should have had a mirror site."

I was only half-listening when I heard it, but if I recall right, this radio show cast some doubt on the question as to whether there was one single, catastrophic fire, or more of a slow process of neglect, which was just as devastating in the long run.