Format identification is the process of determining the format to
which a digital object conforms; in other words, it answers the question:
"I have a digital object; what format is it?"

Format validation is the process of determining the level of
compliance of a digital object to the specification for its purported format,
e.g.:
"I have an object purportedly of format F; is it?"

Format validation conformance is determined at three levels:
well-formedness, validity, and consistency.

A digital object is well-formed if it meets the purely
syntactic requirements for its format

An object is valid if it is well-formed and it meets the
higher-level semantic requirements for format validity

An object is consistent if it is valid and its internally
extracted representation information is consistent with externally supplied
representation information

For example, a TIFF object is
well-formed if it starts with an 8 byte header followed by a sequence of
Image File Directories (IFDs), each composed of a 2 byte entry count and
a series of 8 byte tagged entries.
The object is valid if it meets certain additional
semantic-level rules, such as that an RGB file must have at least three sample
values per pixel.
The object is consistent with external
NISO Z39.87 metadata if that
metadata is consistent with the representation information of the object
that is extracted by JHOVE.

The concept of distinguishing between well-formedness (syntactic correctness)
and validity (semantic correctness) was
taken from XML.

Format characterization is the process of determining the
format-specific significant properties of an object of a given format, e.g.:
"I have an object of format F; what are its salient properties?"

The set of characteristics reported by JHOVE about a digital object
is known as the object's representation information,
a concept introduced by the Open Archival Information System (OAIS)
reference model [ISO/IEC 14721].
The standard representation information reported by JHOVE includes:
file pathname or URI,
last modification date, byte size, format, format version,
MIME type, format profiles, and optionally,
CRC32, MD5, and SHA-1 checksums
[CRC32,
MD5,
SHA-1].

Identification, validation, and characterization actions are frequently
necessary during routine operation of
digital repositories and for digital preservation activities.

The output from JHOVE is controlled by output handlers.
JHOVE uses an extensible plug-in architecture; it can be configured at the
time of its invocation to include whatever specific format modules and
output handlers that are desired.
The initial release of JHOVE includes modules for
arbitrary byte streams,
ASCII and
UTF-8 encoded text,
TIFF,
HTML,
XML,
JPEG,
JPEG2000,
and PDF,
AIFF and
WAVE audio; and
text and XML output handlers.

JHOVE is written in Java to conform to
Java 2 Platform, Standard Edition
(J2SE) 1.4.
A J2SE 1.4-compliant Java Runtime Environment (JRE) is required for
proper operation of JHOVE.
(JHOVE was originally implemented using the Sun J2SE SDK 1.4.1, but has
also been tested to run properly under
Sun J2SE SDK 1.4.2.)
JHOVE should be usable on any Unix, Windows, or OS X platform with the
appropriate J2SE installation.

If you would like to recompile the JHOVE source code, then
ApacheAnt is necessary.
Note that the JAVA_HOME environment variable must be appropriately
assigned for Ant to function properly.
(JHOVE was implemented and tested using
ANT 1.5.1.)

JHOVE Release 1.0 is distributed as gzip'ed tar files or ZIP files
available from the
download page.
When uncompressed and disaggregated, these result in the following
installation directory structure:

The following jar files are meant to be used for embedding JHOVE functionality
into new applications or systems:

jhove.jar

Contains the JHOVE API interfaces and classes

jhove-handler.jar

Contains the standard JHOVE output handlers

jhove-module.jar

Contains the standard JHOVE modules

The following jar file is meant to be used with the stand-alone JHOVE
application using a command-line interface. It contains the main
Jhove class and the contents of jhove.jar,
jhove-handler, and jhove-module.jar:

JhoveApp.jar

The following jar file is meant to be used with the stand-alone JHOVE
application using a
Swing GUI interface.
It contains the main
JhoveView class and the contents of jhove.jar,
jhove-handler, and jhove-module.jar:

For proper operation, the <jhoveHome> element in the
configuration file, jhove/conf/jhove.conf, must be edited to
point to the absolute pathname of the JHOVE installation, or home,
directory and the temporary directory (in which temporary files are
created):

The JHOVE home directory is the top-most directory in the distribution TAR
or ZIP file.
On Unix systems, /var/tmp is an appropriate temporary directory; on
Windows, C:\Temp.
For example, if the distribution TAR
file is disaggregated on a Unix system in the directory "/users/stephen/
projects", then the configuration file should read:

In the JHOVE home directory, edit the JHOVE Bourne shell driver script,
"jhove", (or the equivalent DOS shell script,
"jhove.bat")
and set the JHOVE home directory, Java home directory,
and Java interpreter:

where JHOVE_HOME is set to specify the absolute pathname of the
JHOVE home directory;
JAVA_HOME is set to specify the absolute pathname of the
Java home directory; and
JAVA is set to specify the absolute pathname of the
Java interpreter.
For example:

In the DOS shell driver script, "jhove.bat", the equivalent three
variables are:

SET JHOVE_HOME=jhove-home-directory
SET JAVA_HOME=java-home-directory
SET JAVA=%JAVA_HOME%\bin\java

For example:

SET JHOVE_HOME="C:\Program Files\jhove"
SET JAVA_HOME="C:\Program Files\java\j2re1.4.1_02"
SET JAVA=%JAVA_HOME%\bin\java

The quotation marks are necessary because of the embedded space characters.
On Windows platforms it may also be necessary to add the Java bin subdirectory
to the System PATH environment variable:

PATH=C:\Program Files\java\j2re1.4.1_02\bin;...

Specific instructions on installing JHOVE in a Windows XP environment
are available.
For additional information on setting a Windows environment variable,
consult your local documentation or system administrator.

At the time of its invocation,
JHOVE performs dynamic configuration of its modules and output handlers
based on a XML-formatted configuration file.
The configuration file is specified by the first valid value defined as:

The -c configcommand line argument (only for the command-line interface);

The file ${user.home}/jhove/conf/jhove.conf,
where ${user.home}
is the standard Java user.home property; or

The edu.harvard.hul.ois.jhove.config property in
the properties file ${user.home}/jhove/jhove.properties.

Note that the GUI interface only searches for the configuration file at
the second and third locations listed above;
it does not make use of the -c config option.

All format modules and output handlers must be specified in the
XML-formatted configuration file, validatable against the
XML Schema
<http://hul.harvard.edu/ois/xml/xsd/jhove/jhoveConfig.xsd>.
(In the following display, brackets [ and ] enclose optional configuration
file elements.)

The optional <defaultEncoding> element specifies the default character
encoding used by output handlers.
This option can also be specified by the -e encodingcommand line argument.
The default output encoding is UTF-8.

The optional <tempDirectory> element specifies the pathname of
the directory in which temporary files are created.
This option can also be specified by the -t directorycommand line argument.
On most Unix systems, a reasonable temporary directory is "/var/tmp";
on Windows, "C:\temp".

The optional <bufferSize> element specifies the buffer size
use for buffered I/O.
This option can also be specified by the -b buffercommand line argument.

The optional <logLevel> element specifies the
logging level, used by calls to the logging API.
This option can also be specified by the -l log-levelcommand line argument. The
default is SEVERE.

All class names must be fully qualified with their package name, for example:

The order in which format modules are defined is important; when performing
a format identification operation, JHOVE will search for a matching module in
the order in which the modules are defined in the configuration file.
In general, the modules for more generic formats should come later in the list.
For example, the standard module ASCII should be
defined before the UTF-8 module, since all ASCII
objects are, by definition, UTF-8 objects, but not vice versa.

The optional <init> element is used to pass a module-specific
argument to a module at the time it is first instantiated within JHOVE.
See the details for the individual modules to see if such an argument
is defined. The use of the <init> argument is currently not
defined for any of the standard JHOVE modules.

The optional and repeatable <param> element is used to pass a
module-specific parameter to a module immediately prior to each invocation of
the module's parse() method.
See the details for the individual modules to see if such a parameter
is defined.

In addition to the modules and output handlers specified in the
configuration file, JHOVE is also always statically linked with
the standard Bytestream module and Text and XML output handlers.

The JHOVE command-line interface
is invoked by the Bourne shell script "jhove" (under Unix)
or the DOS shell script "jhove.bat" (under Windows) in
the JHOVE installation directory.
This script properly sets the Java CLASSPATH and executes the
Jhove class with the Java interpreter.

In the invocation syntax below,
brackets [ and ] enclose optional arguments.
In addition to the syntax specified in subsequent sections,
any of the following standard options can also be used:

The following syntax is used to discover, or identify, the format of
a digital object.

jhove ... [-ks] file-or-uri1 .. file-or-uriN

where the first ellipsis ... is a placeholder for any of the optional standard
options defined above.

The digital object(s) can be specified as a file or
directory pathname or as a URI. If a directory is specified,
JHOVE will recursively walk through the directory.
The optional -s flag specified that the identification should be
performed solely on the basis of the internal signatures (e.g., magic numbers)
associated with the formats, rather than by a complete parsing of the object.
After the object's format has been identified, its representation information
is displayed.
The optional -k flag specifies that object checksum values should be
calculated and displayed as part of the representation information.

The following syntax is used to determine the validity of a digital
object with respect to a particular format, and to display format-specific
representation information.

jhove ... -m module [-kr] file-or-uri

where the ellipsis ... is a placeholder for any of the optional standard
options defined above.

Many formats use numeric flags to specify format properties.
By default, JHOVE will translate these numeric values into
descriptive strings.
For example, the TIFF compression value 2 corresponds to "CCITT Group 3 RLE".
The optional -r flag specifies that the "raw" data values
should be displayed, not the text labels.
The optional -k flag specifies that object checksum values should be
calculated and displayed as part of the representation information.

The class file implementing the named module must be found on the Java
CLASSPATH at the time of invocation.
Note that JHOVE recognizes module names in a case-insensitive manner:
"ASCII-hul" and "ascii-hul" both specify the standard
ASCII module.

The following syntax options display descriptive information about
various components of JHOVE.

jhove ...
jhove ... -m module
jhove ... -H output-handler

where the ellipsis ... is a placeholder for any of the optional standard
options defined above.

The first invocation option will display descriptive information about
JHOVE itself,
including a list of all loaded modules and output handlers.
The second option will display descriptive information about the named
module.
The third option will display descriptive information about the named
output handler.

The class file implementing the named module or output handler must be found
on the Java CLASSPATH at the time of invocation.
Note that JHOVE recognizes modules and
output handler names in a case-insensitive manner:
"ASCII-hul" and "ascii-hul" both specify the standard
ASCII module.

(Recall that JHOVE module names can be specified in a case-insensitive manner.)

AIFF representation information is formatted by the
output handlers
consistent with the proposed AES-X098B, Core audio metadata XML
definition, currently under development by the
Audio Engineering Society (AES)
SC-03-06
Working Group on Digital Library and Archive Systems.
Additional representation information includes the audio technical properties
of all chunks.

The JPEG2000-hul module is invoked by the
following command line option:

jhove ... -m jpeg2000-hul ...

(Recall that JHOVE module names can be specified in a case-insensitive manner.)

JPEG representation information is formatted by the
output handlers
consistent with the NISO image metadata
[NISO Z39.87].
Additional representation information includes the image technical properties
of all boxes.

Only digital objects consisting entirely of properly UTF-8-encoded text
[Unicode]
are well-formed and valid with respect to the
UTF8-hul module.

The module is invoked by the following command line option:

jhove ... -m utf8-hul ...

(Recall that JHOVE module names can be specified in a case-insensitive manner.)

In addition to the standard representation information, the UTF-8 module
includes the number of characters, the Unicode character blocks
[Unicode blocks],
and the line endings used in the digital object.

(Recall that JHOVE module names can be specified in a case-insensitive manner.)

WAVE representation information is formatted by the
output handlers
consistent with the proposed AES-X098B, Core audio metadata XML
definition, currently under development by the
Audio Engineering Society (AES)
SC-03-06
Working Group on Digital Library and Archive Systems.
Additional representation information includes the audio technical properties
of all chunks.

(Recall that JHOVE module names can be specified in a case-insensitive manner.)

The XML-hul module can use any XML parser that conforms to the
SAX2 interfaces.
The actual parser used is the first valid value defined as:

The parser specified by the -x sax-class command line
option (whose class file must be found on the CLASSPATH at the
time of execution);

The value of the edu.harvard.hul.ois.jhove.saxClass property
in the the properties file
${user.home}/jhove/jhove.properties properties file,
where ${user.home}
is the standard Java user.home property; or

(Recall that JHOVE output handlers can be specified in a case-insensitive
manner.)

The XML handler formats raster still image representation information
according to the MIX schema [MIX]
for the NISO image metadata [NISO Z39.87].

Note: Contrary to the NISO image metadata data dictionary,
JHOVE defines XSamplingFrequency and
YSamplingFrequency as rational values, not positive integers.
This is necessary for images whose image length or width
is not an integral ratio of the image source X or Y dimension.

Audio representation information is formatted according to
the proposed AES-X098B, Core audio metadata XML
definition, currently under development by the
Audio Engineering Society (AES)
SC-03-06
Working Group on Digital Library and Archive Systems.

8 Logging support

As an aid to debugging third-party modifications, JHOVE supports
the Java logging API. As delivered, each instance of
JhoveBase creates a logger named
"edu.harvard.hul.ois.jhove", and any module which invokes the
ModuleBase constructor creates a logger named
"edu.harvard.hul.ois.jhove.module". The logging level
can be set either with the logLevel element of the configuration
file or with the -l parameter in the command line. Permissible
logging levels are OFF, SEVERE, WARNING, INFO, CONFIG, FINE, FINER
FINEST, and ALL. The default logging level is SEVERE.
See the Sun
logging overview for more information on logging.