Analysis of Data Overlap Between the MARC21 Bibliographic Format

and the TextMD Data Element Set

INTRODUCTION

The present paper examines the MARC21 Bibliographic format and the TextMD element set for correspondences in the data
which they accommodate. TextMD’s area of concern is that of textual
digital entities. TextMD is introduced
as having been developed as an extension schema for METS, which explains
TextMD’s having no element that serves as an identifier of the object being
described, an omission that makes the statement that the TextMD schema “could
also exist as a standalone document” challenging to comprehend.

The precise number of TextMD
elements is relatively small, the exact number depending on whether attributes
are counted and on which version of the schema is being examined. This analysis considered version 3.01, but is
equally applicable to version 2.2 because the newly added elements have no
discernable equivalents, or at least no specific homes, in MARC21.

TextMD values are generally expressed
as strings, although there is one (byte_size)
that requires an integer, and the elements added for version 3.01 are described
as “tokens” in the schema, where the enumerated values appear as strings. Roughly three fourths of string values, so
denominated, accept free text; the remainder are restricted to the values in
enumerated lists.

Findings of this investigation are
expressed in a pair of tables that have the same contents arranged
differently. Table MARC21 vs. TextMD is ordered by MARC tag and subfield code; table TextMD vs. MARC21 follows the order of
MIX data elements as presented in its XML schema. Data elements are not identified by numbers
in the TextMD documentation, but have been assigned them here to facilitate
sorting the tables.

NOTES ON THE ANALYSIS

Names of TextMD elements and
attributes are italicized in these notes.

With the exception of fields and
elements referring to the language of the text object, correspondences between
MARC21 and TextMD are fuzzy. The tables
have been built assuming that it is preferable to identify possible
correspondences that may be tenuous, or even erroneous, than to omit a
connection of importance inadvertently. Users
of these tables who are more knowledgeable of their or other’s TextMD implementations
may be able to eliminate some table entries as universally inappropriate or too
rare to be worthy of consideration.

The MARC21 fields (041, 546) and TextMD
elements (language, alt_language) concerning
the language(s) of the text object are all repeatable, making it fairly easy to
translate the information from one scheme to the other, having only to convert
between MARC codes and ISO 639-2 in most cases, and to supply the necessary authority
information in either scheme. When 041 and 546 are absent, the non-repeatable
MARC21 element 008/35-37 can be used similarly for simple cases.

Certain positions in the MARC 007 field of a
computer file record may of limited use in formulating a TextMD QUALITY attribute, especially if that
attribute is not expected to be any more precise than the single word “good” as
shown in the example provided in TextMD documentation.

MARC has a field (514) called Data
Quality Note whose name suggests relevance to the TextMD QUALITY attribute, but the definitions of
514 subfields all point toward limiting the use of the field to cartographic
resources, hence unhelpful for text objects.

The tables show MARC21 field 538
(System Details Note) as relating to TextMD containers (encoding, character_info) and distinct elements (printRequirements, viewingRequirements)
on the basis of examples given in MARC documentation, but it seems thoroughly
possible that information concerning other elements (e.g. markup_basis, markup_language) might also be encountered in 538. Documentation of 538 does not distinguish
between the computer requirements of the system that created the digital object
(corresponding to TextMD encoding_platform)
and those concerned with viewing or printing the object.

MARC21 Field 856 (Electronic
Location and Access,) true to its title, is apt to offer little information
that will fit in the TextMD encoding or character_info containers, possible
exceptions being $c (Compression information.) as part of encoding_software, and $r (Settings) for determining byte_size, although the $r expression of
the latter has begun to look quaint.

On the other hand, 856 may contain,
in various subfields, data appropriate to TextMD printRequirements and/or viewingRequirements inasmuch as those elements are specified to be free text strings.

The tables cite field 500 as the
possible MARC21 home for data that would appear in the catch-all TextMD
elements processingNote and textNote, but, depending on the content
of individual instances of those elements, somewhat more specific 5xx notes may
occasionally be more appropriate.

Data in new TextMD version 3.01
elements (e.g. representationSequence,
lineLayout, lineOrientation, characterFlow) might also be reflected in
somewhat different form in MARC21 note fields such as 500 and 546, particularly
when the text of the object is written in non-Latin scripts.

The process of deciding what kinds
of data are expected in any MARC21 subfield of interest, particularly in free
text note fields, has been guided by the examples provided in the full online
MARC21 Bibliographic Format document. No
attempt was made to search for additional examples in library databases.

The investigator is aware of his
tendency to think of metadata comparisons of this sort in terms of conversions
from one scheme to another. That is a
matter more limited than the general question of data overlap. Nevertheless, that mindset may have crept
occasionally into the language used to talk about the relationships of certain
data elements between MARC21 and TextMD. This should not seriously affect interpretation of the findings.

In carrying out this task, the investigator noted that re-examining some features may serve the scheme well in the long run. Here are some things to consider.

Unify element naming
conventions. The use of camelCase in
some element names, words separated by underscores in others, and all-caps for
one attribute name is prone to cause error.

Eliminate use of the same name for
elements that are functionally different. The name linebreak seems to be
applied to the same datum in two ways, once as an attribute of an element in the
encoding container, once as an element in different container. The name encoding is employed for a container and also for an attribute of an element in a
different container. (The latter usage
is further complicated by its relation to the charset element. That is not a fault of TextMD; rather it devolves
from the IANA character set name list’s inclusion of names for both character repertoires
and encodings.)

Prepared for the Library of Congress by Charles W. Husbands21 January 2010