Monthly Archives: December 2012

While playing with OpenOffice in my research for Files that Last, I came across a preservation risk. I copied an image from a website and pasted it into a text document, then looked at the resulting XML. The image data wasn’t anywhere in content.xml or anywhere else in the overall ZIP document. Instead, I found this:

The source for the image is on the Web. This means that if the URL stops working, the document loses the image. That’s a poor plan for long-term storage.

The way to avoid this is to use Edit > Paste special and paste the image as a bitmap. It can be a pain to remember to do this. You may be able to catch images that are pasted by reference, since there can be a brief delay while just a box with the URL is displayed before the image comes up.

Sneaky little preservation hazards like this (and the earlier one mentioned with Adobe Illustrator files) are the kind of thing you’ll find when Files that Last comes out.

As a practice run for publishing Files that Last on Smashwords, I’ve put together a small but hopefully useful e-booklet, JHOVE Tips for Developers, which I’m planning to put up there on a “choose your own price” basis. This will help me work out the process of creating the book on a small scale, and maybe it will buy me a Whopper and fries.

For a book of this sort I obviously can’t afford paid proofreading, but I’m hoping one or two people might give it a looking over before I submit the book. You can get the draft as a PDF here.

I’d offer you a free copy in return, but you can get that anyway. What I can do is offer people who give useful feedback credit in the book, as well as my personal thanks.

Yesterday I was doing some experiments with Adobe Illustrator. According to some web sites, The CS5 version saves its files as PDF, though with the extension .AI. When you save a file, though, the options dialog has a checkbox labeled “Create PDF Compatible File.” I unchecked it and saved the file, then opened it in JHOVE. JHOVE says it’s perfectly good PDF — indeed, PDF/A. Then I tried opening it in Preview, and this is what it looked like:

If you don’t actually look at the file but trust the mere fact that it’s a PDF, you might put it into a repository and find out later on that it’s worthless as a PDF. What’s happening is that PDF can embed any kind of content, and this one embeds its native PGF data. Any PDF reader can open the file, but only an application that understands PGF can use its actual content. Anyone putting PDF into a repository should be aware of this risk.

It’s outside the scope of JHOVE to check whether embedded content is acceptable to PDF/A, so the claim that it’s correct PDF/A is probably spurious. It is, however, definitely legal PDF.

I’ve put up JHOVE 1.9 on the SourceForge site today. I think it’s the
least buggy version ever. Please let me know if I’m wrong.

Release notes:

GENERAL

Jhove.java and JhoveView.java now get their version information from
JhoveBase.java. Before it was redundantly kept in three places, and
sometimes they didn’t all get updated for a new release. Like in 1.8.

ConfigWriter was in the package edu.harvard.hul.ois.jhove.viewer, which
caused a NoClassDefFoundError if non-GUI configurations didn’t include
JhoveViewer.jar in the classpath. It’s been moved to
edu.harvard.hul.ois.jhove.

Added script packagejhove.sh and made md5.pl part of the CVS repository
to make packaging for delivery easier.

jhove.bat now simply uses the Java command rather than requiring
the user to set up the Java path.

JhoveView.jar and jhove (the top level shell script) are now forced
by ant to be executable so there are no mistakes.

Configuration file code for adding handlers and giving init strings
to modules was an awful mess that never could have worked. Major repairs done.

AIFF MODULE

If an AIFF file was found to be little-endian, the module instance
would stay in little-endian mode for all subsequent files. This
has been fixed.

TIFF MODULE

TIFF files that had strip or tile offsets but no corresponding byte
counts were throwing an exception all the way to the top level. Now
they’re correctly being reported as invalid.

XML MODULE

Cleaned up reporting of schemas, Added some small classes to replace
the use of string arrays for information structures. Made URI comparison
for local schema parameter case-independent. Resolved conflict between
“s” and “schema” parameters.

WAVE MODULE

Some uncaught exceptions caused the module to throw all the way
back to JhoveBase and not report any result for certain defective
files. These now report the file as not well-formed.

My daily update on the Files that Last blog includes a new song about digital preservation. It’s to promote my Kickstarter campaign for Files that Last and shares the book’s title, but you might find it fun in its own right. Naturally there’s a WAVE file in addition to the MP3. Links are appreciated.

It’s started! Today I’m launching a Kickstarter campaign to help fund the completion and publication of my e-book, Files That Last. Rather than repeat everything I’ve said on the Kickstarter page and the homepage for the book, I’ll say just enough to convince you, as someone who cares about formats and digital preservation, that it’s worth looking at those pages and considering helping to fund the book and spread the word.

So far there isn’t, as far as I know, a book to promote and explain digital preservation to people who understand computers but aren’t part of the library and archiving world. That’s where I’m aiming this book. If you look at the Library of Congress’s personal archiving pages, that gives you some idea of what I’m aiming at, though I’m also addressing nonprofit organizations and businesses. It’s not a book for programmers, but it will have enough technical detail to give an understanding of how formats, metadata, and media affect the longevity of files and how to make best use of them.

If you pledge $10, you’ll get an electronic copy of the book when it’s done (DRM-free, naturally). For just $100, you can use it as a classroom text and distribute it to up to 50 students!

Lately I’ve been writing a user guide for JHOVE as part of an upcoming
book. This means going through all the features to see how they really
work, and this has turned up a number of bugs. Among the latest fixes
are are: (1) If the AIFF module encounters a little-endian file, it
treats all subsequent files as little-endian whether they are or not.
(2) Certain errors in WAVE files throw an exception from the module
instead of reporting that the file isn’t well-formed. (3) The XML
module’s “s” and “schema” parameters conflicted, with “schema” being
treated as both, and there was a problem with schema URIs with
upper-case characters.

Version 1.9b3 should fix all of these. Hopefully I won’t find anything
else that needs fixing soon, so we can finally have a 1.9 release. but
if there are any problems with this beta, please let me know!

JHOVE 1.9b2 is up, fixing issues with the configuration file. The code for editing the configuration file from the GUI was just completely broken, but I think it’s fixed now. I can’t imagine anyone was ever trying to add init strings to modules (none of the standard ones use one anyway) or add handlers using the GUI, or someone would already have noticed. But I couldn’t stand having it not fixed, so the new build is there.

JHOVE 1.9b1 is now up on SourceForge. The only significant difference is that jhove.bat for Windows now uses the “java” command rather than forcing the user to figure out where Java lives. The *nix script already does this.

I’d like to put up a final 1.9 pretty quickly, so let me know if anything is wrong as soon as you can.