Digital Preservation Webinar Q&A

Below are listed questions that were submitted during the NISO webinar, "What It Takes To Make It Last: E-Resources Preservation," held February 10, 2010. Not all the questions could be responded to during the live webinar, so those that could not be addressed at the time are also included below.

Feel free to contact us if you have any additional questions about library, publishing, and technical services standards, standards development, or if you have have suggestions for new standards, recommended practices, or areas where NISO should be engaged.

What It Takes To Make It Last: E-Resources Preservation
Webinar Questions & Answers
February 10, 2010

Are tools for batch conversions of PDF to PDF/A available yet? If so, what are the best batch conversion tools for PDF to PDF/A? What is lost in the conversion?

This web site http://pdfa.org/doku.php?id=pdfa:en:products:convertpdf lists several tools. One of the issue with PDF to PDF/A conversion tools is that different tools may produced slightly different PDF/A documents and their validators may be implemented slightly different. Bavaria did an evaluation of PDF/A validators against the published test suites and in their report 3 Height and pdfapilot were rated higher than other tools.

Adobe Acrobat supports PDF to PDF/A conversion but its support for batch processing is limited. For batch processing, I found 3 Height more suitable for the job. It supports linux and also has a command line converter which can also be configured to do a post-conversion pdf/a validation. It can also be configured to embed un-embedded fonts and ICC color profiles, thus reduce the failure rate. After working through issues with 3 Height developers/customer service, I was able to convert about 90% of the 100+ PDFs that I sampled from our collection of ETDs. pdfapilot also has a pdf to pdf/a conversion, but I did not evaluate it.

I don't recall much being lost in conversion, apart from video that was converted to still image.

Has the Center for Research Libraries certified any organizations or archives engaged in digital preservation?

However please note that CRL is not a certified certifiying authority, if you understand what I'm saying. CRL took on this role independently, and only claims authority for its certification within the CRL community.

For Jeremy: How many TB do you manage in Hathitrust?

Answer (Jeremy York): As of February 10, there were just shy of 200 TB of data at each of HathiTrust’s two active storage locations (in Michigan and Indiana).

Has thought been given to replacing MDF with the SHA-1 and SHA-2 hash algorithms?

Answer (Jeremy York): (I believe the question was referring to MD5). No, not at this time.

Do I understand correctly that it appears Hathitrust's future financial model will charge more for libraries who have weeded their print collections?

Answer (Jeremy York): No, the primary criterion is holdings and overlap with the repository. The new model is designed to reflect the benefit that partner institutions receive from digital volumes preserved in HathiTrust. These include preservation and access services, as well as specialized services for users with print disabilities and section 108 uses of materials (access to digital copies of volumes that meet section 108 criteria on library premises, replacement print copies, etc.). Institutions that wish to receive these benefits for volumes they hold, or have held in their collections would share in the cost of curating and preserving the digital copies in HathiTrust. Though many details need to be determined, it's likely that if an institution did not wish to receive services for certain volumes in its collection, it would be able to opt these volumes out and not share in the costs of maintaining the digital copies.

Is there an average number or percentage of volumes across institutions within the Hathitrust that have been digitized by Google?

Answer (Jeremy York): The overwhelming majority of volumes in the repository have been digitized by Google (approximately 0.1% is made up of content digitized locally at the University of Michigan). The number of non-Google volumes will increase over time as we begin ingest of partner materials digitized by the Internet Archive and through local digitization projects. We will be devoting significant resources to the ingest of non-Google books and journals in the very near future.

Could you suggest to me workshops, reading, webinars that target institutions who have not even begun to consider this issue yet--where to start. I actually understood the content of this webinar, but still don't have the slightest idea about what a single small institution needs to do to start collecting electronic information and preserving it properly.

Answer (Priscilla Caplan): Very good question. Good, general, action-oriented introductory materials are hard to find, and there isn't a comprehensive program of preservation education and trainig in the U.S.

In 2007 and 2008, the Northeast Document Conservation Center (NEDCC) held a set of regional workshops called "Stewardship of Digital Assets". Although the workshops are over, the tools developed for the program, including an extensive resource list, are available at http://www.nedcc.org/resources/sodatools.php.

My personal belief is that all institutions can contribute to the curation of sustainable digital objects, but long-term preservation is best handled by central consortial or third party institutions. BCR (The Bibliographical Center for Research) is holding a series of IMLS-funded workshops on Digital Preservation for Digital Collaboratives in 2010 (http://www.bcr.org/dps/training/neh-dpdc.html). If your institution belongs to a library network or any other type of collaborative, this would be well worth considering.