Identifiers:
Unique, Persistent, Global

The Importance of identifiers Today

The human sensory system is a marvelous thing. I
can recognize a face (even though I will often forget the name that
goes with it). I can tell a rose from an iris by its smell as well as
its look. I can name a bird that I cannot see by its distinctive
call. I can search the library catalog for the string "moby
dick" and wade through a retrieved set of different editions of
the book and of films based on the book and can select an individual
entry that meets my needs. With our senses and our brain we can
identify items in the world around us with an impressive amount of
accuracy.

Computers are often called "electronic
brains" and are said to "think." These disembodied
mechanical brains lack sensory input, however, and therefore they
don't have our ability to tell a daisy from a dandelion, or Moby
Dick from Dick and Jane. For a computer to act on any data
about a thing, it needs to be given an identifier that represents the
thing in its computational model. As our machine-to-machine activity
has increased over the decades since computers became commonplace,
our need for identifiers has increased as well. Identifiers that
served us well in the early days of business transactions, like the
ISBN, are showing their age. We not only need to apply these
identifiers to a larger number of items than was originally intended,
we also want to be able to refer to individual parts of those items,
like chapters or pages. Suddenly it seems that there are not enough
numbers in the world to identify everything that we need to identify.

In March of 2006, The National Information
Standards Organization (NISO) held a roundtable discussion at the
National Library of Medicine with experts from a variety of areas
where identifiers are created, used, and maintained. The purpose of
this meeting was to articulate needs that might be met through
standards or other activities that could be led by NISO.1
The attendees agreed that we need more information in our community
generally about the technology of identifiers and their use in
systems and services. Experts in the room talked of common
misconceptions about identifiers and clarified definitions, as well
as proposed solutions. I present some of those in this article.

What is an Identifier?

Identifiers, as we use them in our electronic
systems, are strings of numbers, letters, and symbols that represent
some thing. Identifiers are said to "name" things
and "[n]aming entities makes it possible to refer to them, which
is essential for any kind of processing." 2
But how does a string like "039450643X" come to name, in
this case, a particular edition of Remembrance of Things Past,
by Marcel Proust? The book was originally published at a time when no
such identifier was used. It is the assignment of that string
to that thing, the book, by the publisher that makes it an
identifier. Some would go so far as to say that the identifier is not
the string but "an association between a string (a sequence of
characters) and an information resource."3
In either case, we recognize that an identifier is a convention that
requires some person or organization to assert the relationship
between the string and the thing. The
string 039450643X is meaningless without the ISBN standard that
defines it, and without the action of the publisher that assigns that
string to a book. In this sense, identifiers are social agreements,
and their value depends on the dedication of the organization that
creates, maintains, and assigns them.

Qualities of identifiers

You often see the word "identifier"
preceded by an adjective, like "unique" or "persistent."
These are qualities that we often require of identifiers because they
are thought to be necessary for identifier use. These qualities are
indeed important, but exactly what they mean and how they help us
maintain reliable systems and interoperate electronically is more
nuanced than you might imagine.

Unique Identifier

Uniqueness is one of the basic requirements that
is cited for identifiers. But what does it mean to say that an
identifier is unique? There are at least two ways to define "unique"
for identifiers:

each thing has one and only one identifier

each identifier refers to one and only one
thing

In reality, uniqueness is relative to the task at
hand. In terms of having one and only one identifier, think about the
many identifiers that are associated with you: your name, your Social
Security Nunber (SSN), your credit card numbers, your driver's
license number, and others. Each of these is you, but each can serve
a different function. When you file your taxes with the IRS you are
identified by your SSN, but when you make a purchase with a credit
card you are identified by that credit card number. We are complex
creatures who exist and interact in a variety of contexts. Each of
those contexts can, and often does, use its own identifier for us. So
we can amend our statement by saying that each thing has one and only
one identifier within a defined context.

In terms of an identifier referring to one and
only one thing, it all depends on what "one" means, and it
means different things in different contexts. Publishers consider the
ISBN to be an item-level identifier, but their item is available in
many copies. Libraries consider the ISBN to be at the level of a
title (in the pre-FRBR sense of that word), and assign barcodes to
identify each physical item on their shelves. The uniqueness in this
case relates to the granularity of the need. My twelve apples are
your one dozen. Both are correct, but they would require different
identifiers.

Persistent Identifier

Persistence is often cited as a primary quality of
an identifier. In simple terms, the answer to the question: "How
long does an identifier need to last?" is: "As long as it
is needed." The identifier for an IP packet going across the
Internet has to last as long as it takes to reach its destination on
the network, which may be a matter of thousandths of a second. For
another kind of package, that carried by the delivery company UPS,
the identifier needs to last until the package is delivered and
billed. A Social Security Number persists for the life of the
individual to whom it was assigned. If we do develop a registry of
authors for the purposes of tracking copyrights, that identifier will
need to persist for at least seventy years after the death of the
author, and preferably for as long as the author's works exist in
some form.

Persistence of identifiers is a particular issue
for libraries and other cultural heritage institutions because we
have no end date on our commitment to the resources we manage. As the
above examples show, in other contexts the identifier has a term of
usefulness after which it can be retired. Persistence, however, is
not a characteristic of the identifier technology: "...
persistence is a function of organizations, not technology."4
Because identifiers are an assertion of a relationship between a
string and a thing, they persist as long as the assertion is
maintained. This is the difference between your average URL, which
may have no expectation of persistence behind it, and the persistent
URL (PURL) service at OCLC that has an organizational commitment to
its longevity. The Archival Resource Key (ARK) developed at the
University of California includes a commitment statement from the
service managing the identifiers as part of its design.5

An aspect of persistence that is particularly
relevant to complex cultural resources is how a change in the
resource is reflected in the identifier assignment. This is not
limited to the identification of digital materials; the impact of new
editions for monographs and title changes for serials was an issue
for libraries long before the digital age. Decisions such as these
need to be governed by clear policies on the part of the agency
managing the assignment of the identifiers. Where this aspect of
persistence is not managed, such as on the Web, there is no guarantee
that the same identifier will point to the same resource at different
moments of time. When we cite Internet resources we often feel
obliged to qualify our citation with a date ("accessed Feb. 6,
2005") precisely because there is no guarantee of persistence.

Another area where identifiers may or may not
persist is in the meaning and use of the identifier itself.
Identifiers are used in systems and a society that are in constant
change. We have seen that the social security number, originally
intended to connect a person, her earnings, and the Social Security
Administration, has become a de facto personal identifier for
schools, for medical insurance, and banks. The International Standard
Book Number (ISBN) as been famously assigned to teddy bears and
biscotti to support the needs of retail bookstores. Today you can
retrieve your boarding pass at an airport using a credit card number,
even if that credit card was not the one originally used to purchase
the electronic ticket. Each of these are examples of the evolution of
what the identifier represents, and they are also uses outside of the
arena of commitment of the managing agency. The bank does not
guarantee that your credit card will be recognized by the airline's
ticket machine, and the Social Security Administration has no
responsibility over the uses of the SSN beyond its own. Opportunistic
uses of identifiers may facilitate business functions, but
reliability and commitment may be sacrificed in the process.

Global Identifiers

It is often said that a certain class of
identifiers must be "globally unique." That is, that they
can be used anywhere in any system and will never overlap with an
identifier assigned by someone else. This is a growing concern that
arises out of the increasing interaction between systems in the
digital and networked world. The common experience is that an
identifier is created within a system or within a context, and at a
later date it needs to be used in another or larger context. At that
point, the identifier may no longer be unique. An example from our
environment is the MARC record identifier within the local ILS. It is
common that every database assigns a unique identifier to each record
stored within it. But if at a later date the library wishes to
participate in a union catalog, these record identifiers could very
easily overlap with those of other libraries.

There are techniques that allow the creation of
globally unique identifiers. One of these is the Universal Unique
Identifier (UUID), a mathematically derived 128-bit number that is
virtually guaranteed to be unique for the next millennium. Although
this is a solution, it is perhaps more rigorous than most of us wish
to undertake. A simpler solution is the one used by the Uniform
Resource Locator (URL): because every Internet site must have a
unique address assigned through the domain name system, the owner of
that address can prepend it to any string, essentially saying: "what
follows is my identifier." The file index.html exists in many
millions of instances throughout the World Wide Web, yet each one is
uniquely identified by the domain name and path that precede it.
Similarly, the bibliographic record identifier from a library system
can be allowed to interact in a larger bibliographic context by
prefixing it with the library's code from the MARC Code List for
Organizations that is managed by the Library of Congress.6
In this case, "global" uniqueness is really global within a
large but not universal context. There is nothing to prevent another
community from creating an identifier that would be the same as one
from a library database, including the organization code, but one
weighs the risk of this occurring against the efficiency and cost of
the solution.

Identifiers in the Library Environment

Libraries have a long history of the use of
identifiers. Incredibly, one of the more common identifiers in use in
libraries today, the Library of Congress Catalog Number (LCCN), was
first used in 1898.7
ISBNs, which we now take for granted, were only first assigned in
1970.8
ISSNs came into use in 1975.9
Both the ISBN and the ISSN, along with their newer cousins the ISMN
(International Standard Music Number)10
and the ISAN (International Standard Audiovisual Number)11
are all standards agreed on through the International Standards
Organization (ISO). Other identifiers, not unlike the LCCN, have
become standard through use rather than through a formal standards
process. The OCLC record number is commonly used to identify
machine-readable bibliographic records that were originally obtained
from the OCLC database. The PubMed record identifier (PMID), which
represents the National Library of Medicine bibliographic record in
the way that the LCCN represents the Library of Congress
bibliographic record, is commonly listed in citations of articles in
the medical field. The PMID often provides a unique identifier
between an article citation and the full text of the article even
though this latter is in a database unrelated to PubMed. The Digital
Object Identifier (DOI) is a system that resolves a standard DOI
string to publisher services related to the digital resource.12
The DOI string has become accepted as an identifier even when
resolution is not desired. Yet none of these identifiers covers the
entire world of intellectual resources, and there is overlap among
them: the LCCN and the ISBN both identify books and their metadata;
the DOI and the PMID both identify a subset of the world of journal
articles. A universal resource identifier simply does not exist.

One particular issue related to the longevity of
libraries and library identifiers is the need to use identifiers from
our past in the current highly-networked digital systems. There are
two aspects to this: the first is that we have to specify the name
space of the identifier; the second is that we have to be able to
structure the identifier to meet current standards. The primary
identifier standards for the networked world are the Uniform Resource
Identifier (URI), the Uniform Resource Name (URN), and the Uniform
Resource Locator (URL). These are defined as:

URI – The basic identifier method on
the web. A URI is unique in the context of the web.13

URN - An identifier of the URI type but that
is not limited to the location of the resource.14

URL - An identifier that uses the network
location of the resource as its identification.15

Each of these has a mechanism to assign name
spaces to identifiers that assure that the identifiers created are
unique. The uniqueness of the URL is guaranteed by the domain name
registration process, since the first part of the URL is the domain
name of the location. This means that all URLs belonging to the
domain owned by the Library of Congress will begin with "…loc.gov"
and URLs belonging to Microsoft Corporation will begin with
"…microsoft.com." The top level of the URN16,
which determines uniqueness for those identifiers, is managed by the
Internet Assigned Numbers Authority (IANA). Of the identifiers in use
in the library community, the ISBN, ISSN and the ISAN are all
registered as URNs. For example, an ISBN expressed as a URN would
look like: urn:ISBN:0-395-36341-1. Unfortunately, those are
the only identifiers used in the library community that have that
registration. The URI is defined as having the format "uri://…"
and IANA registered URI schemes include the familiar "http,"
"ftp," as well as over four dozen others. Included in the
URI list are schemes for z39.50 retrieval and session.

Between the URN and the URI we are still lacking
standard network identifiers for most of the identifiers that are
used by library systems. This has recently been rectified through the
introduction of a new URI called "info."17
The "info" URI is specifically designed to allow a wide
range of commonly used identifiers to be defined in the URI format.
This provides a home for all of those identifiers that were developed
either before or outside of the Internet identifier standards. Thus
an LCCN can be expressed as " info:lccn/2002022641" and a
Dewey Decimal Classification number can be
"info:ddc/22/eng//004.678." Before using the "info"
URI format for an identifier, the identifier and its name must be
registered in the "info" URI registry.18
The registry includes some key information, such as the contact for
the agency that maintains the identifier. Already more than a dozen
identifiers have been registered there including the LCCN, the PubMed
identifier, the DOI, and the OCLC record number. The creation of the
"info" URI means that library applications can interact
over the Internet in ways that will be understood by non-library
systems. We now have the capability to use our community's
identifiers wherever a standard URI format is required.

Conclusion

As our digital systems grow in complexity and in
reach, the number of identifiers also grows. Many of these are
internal to the systems and will not be relevant to information
exchange. But others may surprise us and, like the LCCN, will become
key components of functions that were inconceivable at the time the
identifier was first created. Library identifiers must be able to
conform the existing network standards, in particular the use of the
URI and the URN identifier formats, when library resources interact
on the Internet. This is yet another way in which we are breaking
down the barriers between libraries and the larger world of
information resources.