Categories

Subscribe

Archive for August, 2012

This is a list of serial publications (journals, yearbooks, magazines, newsletters, etc.) whose editorial board includes at least one person from the University of Notre Dame. This is not a complete list, and if you know of other titles, then please drop me a line:

This blog posting simply points to a browsable and downloadable set of MARC records describing a set of books in both in the HathiTrust as well as the Hesburgh Libraries at the University of Notre Dame.

were denoted as a part of the Hesburgh Libraries at the University of Notre Dame

were denoted as a part of the HathiTrust

had a one-to-one correspondance between OCLC number and digitized item

This list of MARC records is not nor was not intended to be a comprehensive list of overlapping materials between the Hesburgh Libraries collection and the HathiTrust. Instead, this list is intended to be a set of unambiguous sample data allowing us to import and assimilate HathiTrust records into our library catalog and/or “discovery system” on an experimental basis.

The browsable interface is rudimentary. Simply point your browser to the interface and a list of ten randomly selected titles from the MARC record set will be displayed. Each title will be associated with the date of publication and three links. The first link points to the HathiTrust catalog record where you will be able to read/view the item’s bibliographic data. The second link points to the digitized version of the item complete with its searching/browsing interface. Third and final link queries OCLC for libraries owning the print version of the item. This last link is here to prove that the item is owned by the Hesburgh Libraries.

Finally, why did I create this interface? Because people will want to get a feel for the items in question before the items’ descriptions and/or URLs become integrated into our local system(s). Creating a browsable interface seemed to be one of the easier ways I could accomplish that goal.

Fun with MARC records, the HathiTrust, and application programmer interfaces.

This blog posting describes how I created a set of MARC records representing public domain content that is in both the University of Notre Dame’s collection as well as in the HathiTrust.

Background

In a previous posting I described how I learned about the amount of overlap between my library’s collection and the ‘Trust. There is about a 33% overlap. In other words, about one out of every three books owned by the Hesburgh Libraries has also been digitized and in the ‘Trust. I wondered how our collections and services could be improved if hypertext links between our catalog and the ‘Trust could be created.

In order to create links between our catalog and the ‘Trust, I need to identify overlapping titles and remote ‘Trust URLs. Because they originally wrote the report which started the whole thing, OCLC had to have the necessary information. Consequently I got in touch with the author of the original OCLC report (Constance Malpas) who in turn sent me a list of Notre Dame holdings complete with the most rudimentary of bibliographic data. We then had a conference call between ourselves and two others — Roy Tennant from OCLC and Lisa Stienbarger from the Notre Dame. As a group we discussed the challenges of creating an authoritative overlap list. While we all agreed the creation of links would be beneficial to my local readers, we also agreed to limit what gets linked, specifically public domain items associated with single digitized items. Links to copyrighted materials were deemed more useless than useful. One can’t download the content, and searching the content is limited. Similarly, any OCLC number — the key I planned to use to identify overlapping materials — can be associated with more than one digitized item. “To which digitized item should I link?” Trying to programmatically disambiguate between one digitized item and another was seen as too difficult to handle at the present time.

The hacking

I then read the HathiTrust Bib API, and I learned it was simple. Construct a URL denoting the type of control number one wants to use to search as well as denote full or brief output. (Full output is just like brief output except full output includes a stream of MARCXML.) Send the URL off to the ‘Trust and get back a JSON stream of text. The programmer is then expected to read, parse, and analyze the result.

Energized with a self-imposed goal, I ran off to my text editor to hack a program. Given the list of OCLC numbers provided by OCLC, I wrote a Perl program that queries the ‘Trust for a single record. I then made sure the resulting record was: 1) denoted as in the public domain, 2) published prior to 1924, and 3) was associated with a single digitized item. When records matched this criteria I wrote the OCLC number, the title, and the ‘Trust URL pointing to the digitized item to a tab-delimited file. After looping through all the records I identified about 25,000 fitting my criteria. I then wrote another program which looped through the 25,000 items and created a local MARC file describing each item complete with remote HathiTrust URL. (Both of my scripts — filter-pd.pl and get-marcxml.pl — can be used by just about any library. All you need is a list of OCLC numbers.) It is now possible for us here at Notre Dame to pour these MARC records into our catalog or “discovery system”. Doing so is not always straight-forward, and if the so desire, I’ll let that work to others.

What I learned

This process has been interesting. I learned that a lot of our library’s content exists in digital form, and copyright is getting in the way of making it as useful as it could be. I learned the feasibility of improving our library collections and services by linking between our catalog and remote repositories. The feasibility is high, but the process of implementation is not straight-forward. I learned how to programmatically query the HathiTrust. It is simple and easy-to-use. And I learned how the process of mass digitization has been boon as well as a bit of a bust — the result is sometimes ambiguous.

It is now our job as librarians to figure out how to exploit this environment and fulfill our mission at the same time. Hopefully, this posting will help somebody else take the next step.

I have been exploring possibilities of exploiting to a greater degree the content in the HathiTrust. This blog posting outlines some of my initial ideas.

The OCLC Research Library Partnership program recently sent us here at the University of Notre Dame a report describing and illustrating the number and types of materials held by both the University of Notre Dame and the HathiTrust — an overlap report.

As illustrated by the pie chart from the report, approximately 1/3 of our collection is in the HathiTrust. It might be interesting to link our local library catalog records to the records in the ‘Trust. I suppose the people who wrote the original report would be able supply us with a list our overlapping titles. Links could be added to our local records facilitating enhanced services to our readers. “Service excellence.”

Percentage of University of Notre Dame and HathiTrust overlap

According to the second chart, of our approximately 1,000,000 overlapping titles, about 121,000 (5%) are in the public domain. The majority of the public domain documents are government documents. On the other hand about 55,000 of our overlapping titles are both in the public domain and a part of our collection’s strengths (literature, philosophy, and history). It might be interesting to mirror any or all of these public domain documents locally. This would enable us to enhance our local collections and possibly provide services (text mining, printing, etc.) against them. “Lots of copies keep stuff safe.”

Subject coverage of the overlapping materials

According to the HathiTrust website, about 250,000 items in the ‘Trust are freely available via the public domain. For example, somebody has created a collection of public domain titles called English Short Title Catalog, which is apparently the basis of EBBO and in the public domain. [2] Maybe we could query the ‘Trust for public domain items of interest, and mirror them locally too? Maybe we could “simply” add those public domain records to our catalog? The same process could be applied collections from the Internet Archive.

The primary purpose of the HathiTrust is to archive digitized items for its membership. A secondary purpose it to provide some public access to the materials. After a bit of creative thinking on our parts, I believe it is possible to extend the definition of public access and provide enhanced services against some of the content in the archive as well as fulfill our mission as a research library.

I think will spend some time trying to get a better idea of exactly what public domain titles are in our collection as well as in the HathiTrust. Wish me luck.