[NB: Making MetaSausage is a new blog on legislative metadata and legislative systems. It’s a place to talk geek about legislation. We make no promises, but we think posts will appear every couple of weeks. Comments encouraged. ]

The law-creating process described in How Our Laws Are Made (HOLAM), and other civics texts like it, is a lot like the Mississippi River: formed out of a zillion small tributaries, many of them nameless, joined into a stream that passes through a number of jurisdictions and has lots of side passages, loops and eddies, eventually breaking up again into a series of tiny streams passing through a delta. There is a central part of the process — the mainstream — that is fairly well mapped, with placenames and milestones that are pretty well understood. There are hundreds of smaller streams and brooks at either end of the process that are not well understood or named at all, and a few places in the middle where the main stream branches unpredictably. It is a complicated map, and it describes a territory where many people, places and things are named — but many are not, and some are named in ways that are ambiguous, confusing, or conflicting.

This post is about identifiers, and particularly document identifiers : snippets of text that uniquely identify documents that are either generated by the legislative process or are found in its vicinity. That idea is simple enough. But well-thought-out, carefully constructed identifiers are an important foundation of any data model — and are surprisingly difficult to design. Legislative data models have (at least) two purposes: first, they are a kind of specification that precisely describes data encountered in and around the legislative process, the precise relationships among the data items and elements, and (significantly) relationships between the data and the real-world people, groups, and processes that create and manipulate the data. Second, they are a device to enable communication among system-builders, stakeholders, and users about what is to be collected, what is to be expressed or retrieved, and so on. Before any of that can be built in a way that is both precise and communicative, we must be sure of what exactly we are talking about. Identifiers should answer that question — what the hell are we talking about? — unambiguously. Or at least we would like them to. Often, our legacy identifier systems don’t do that very well. As we shall see, many existing identifier schemes are burdened with competing constraints and conflicting expectations, with less-than-ideal results.

What do identifiers do?

In print, identifiers have worked differently than we really want them to in an electronic environment. The conventions of printed books — use of pagination, difficulty of recall once issued, relative stability of editions, and most of all the assumption that identifiers will be interpreted by human readers with some knowledge of their context and purpose — result in identifiers that are less rigorous than what we need in a world of granular data consumed and processed by machines. Some illustrations are found below. In reality our legacy “identifiers” are often less-rigorous monikers serving multiple functions, and in a digital environment we must unpack them into separate items with separate functions. Here are some of the functions:

a) Unique naming. The diverse monikers that document creators and administrators use in current practice are supposed to provide unique names for documents. Sometimes they do; often they don’t. Usually that is because a moniker that is unique within a particular scope loses uniqueness in some wider, unanticipated arena. That is especially likely to happen when a collection of objects is moved from its original, intended scope on to the open Web, but you can find examples closer to home. A Congressional bill number is a good example: it is unique only within the Congress during which it was assigned. There might be an “H.R. 1234” for several Congresses; “108 H.R. 1234” is made unique by the addition of the number of the Congress during which it was introduced. Of course, human error is often at fault, as when (for one year in the mid-1990s), there were two very different section 512s in Title 17 of the US Code.

b) Navigational reference. Identifiers often serve as search terms or convenient handles for taking the reader to another document, or for retrieving it (we discuss retrieval in the next section). Standard caselaw citation practice is a special case of this, created specifically for printed books. In that legacy context, unique identification and citation functions are often run together badly, usually because numbered pages are not sufficiently granular to uniquely identify individual items. For example, two briefly-reported judicial opinions might well appear on the same page of a print reporter, and thus carry an identical citation. The citation is then a perfectly good tool for navigating to each case within a series of printed volumes, but is not a unique name or identifier for either of them. A look at http://bulk.resource.org/courts.gov/c/F3/173/ will show that numerous cases, each quite short, originally appeared on page 421 of Volume 173 of West’s Federal Reporter, 3rd Series. A sample is here: http://liicr.nl/rimZJe . Any of the cases listed might be cited as 173 F.3d 421.

c) Retrieval hook/container label. Here, we distinguish use of a citation as a retrieval hook from its use as a navigational device. As we make our way around the Web, that distinction is usually blurred. Following a link to its destination puts a chunk of text in front of our eyes, and so it’s hard to remember that the link might refer to the contents of a container for which it also provides a label, rather than to a simple destination milestone.

To make the distinction clear, it’s useful to think about incorporation-by-reference or other forms of embedding. Suppose that we wish to present the current text of a subsection of a statute inside some other online document — a citizen’s guide to Social Security benefits, for example. We would likely do that via machine retrieval of the particular statutory subsection based on its identifier — but our goal would be to summon up a chunk of text, not navigate to a particular destination. Put another way, our current practice conflates the use of citation as a means of identifying a point, milestone, or destination in a document (a retrieval hook) with a means of identifying a labelled subdocument that can be referenced or retrieved for other purposes ( a container label).

As an example, the THOMAS pages for individual bills and resolutions aggregate a great deal of information from the Congressional Record (CR), linking from the Bill Summary ‘Actions’ to both a textual representation of the CR page beginning with the desired text (but sometimes extending past the desired text into other information about unrelated issues) as well as a PDF representation of the page which shows the whole page (where the desired text may start towards the end, plus subsequent pages if the relevant section extends past the initial page).

For a specific example of this, the Lily Ledbetter Fair Pay Act of 2009 has a list of major actions on Thomas, one of which is a “motion to proceed to consideration of measure withdrawn in Senate” on Jan. 13, 2009. The link for information on that motion is to CR S349: a specific page of the Congressional Record. Invoking that link leads to this display:

The Thomas page lists the four items on the particular Congressional Record page, the last of which is the item sought. When that item is invoked a default page with the specific text of the motion is retrieved, but an additional link to the PDF version of that page can be viewed via a link at the head of the text, with the Lily Ledbetter motion at the bottom of the retrieved PDF.

d) Thread tag/associative marker. Some monikers group related documents into threads — aggregations whose internal arrangement is implicitly chronological. An insurance company claim number is, in exactly this way, a dual-purpose tool. On the one hand, it refers uniquely to a document (a claim form) that you submit after your fender-bender. On the other, the insurance company tells you that you must “use this claim number in all correspondence” — that is, use it to prospectively tag related documents. That creates a labelled group of documents. If we then sort the group chronologically, it becomes a kind of narrative thread.

In this way, the moniker implies a relationship between the documents without explicitly naming or describing it, as well as being pressed into service as the identifier for one or more documents in the cluster. Regulatory docket numbers function in this manner. That is intentional, because dockets are meant to be gathering places for documents. What is confusing — and important to remember — is that a moniker that uniquely identifies a process — a regulatory rulemaking — has been bent to identify a collection of items associated with that process, and neither the association, the collection of items, nor any particular document have been uniquely identified.

Another conceptually-related but distinct example of this is the use of “captive search” URIs to meet a user’s need to dynamically assemble a set of related documents. For instance, one can retrieve all the environmental law decisions of the Supreme Court at this link:

Such URIs embed search terms (“environment”, “environmental”, “EPA”) and, when used in links, retrieve the set of documents found by searching on those terms. Typically, they are used to deal with instability or growth in the underlying corpus of things being searched. They are “automatically” kept up to date as the collection changes, inasmuch as they just provoke a search of the changed collection that presents results based on the current collection contents.

In that way, they are a great help to site designers. Problems can arise, however, if the user imagines that the URI somehow identifies the exact set of items retrieved for any time period other than the moment of retrieval. Precisely because the method is dynamic, the user may or may not retrieve the same document set at a later invocation. As a low-cost, low-effort alternative to semantic tagging, however, the approach is irresistible.

Some newer systems, such as VIAF, do allow the ad-hoc construction of URIs for dynamically assembled sets of objects that are then fixed as a permanent group identified by the newly-minted URI. Assuming that an appropriate search could be designed, one might thus construct URIs for any useful group of items found in an authority file, for example a list of all subcommittees of the House Armed Services Committee that have existed up to the present:

e) Process milestone. The grant of a moniker by an official body can be an acknowledgement that official notice must now be taken, or that some process has begun, ended, or reached some other important stage. That is obviously the case with bills, where a single piece of legislation may receive a number of identifiers as it makes its way through the process, culminating in a Public Law number at the time of signing. The existence of such a PL number can be taken as evidence that the bill has been passed into law.

f) Proxy for provenance. Again because monikers are often assigned by officials or organizations with special standing, they become proxies for provenance. The existence of a bill number is evidence that the Clerk of the House has seen something and acted in a particular way with respect to it; it is valuable evidence in any attempt to establish authority.

g) Popular names, professional terms of art, and other vernacular uses. Monikers notably find their way into popular and professional use, some in ways that are quite persistent. News media frequently refer to legislation by a popular name created by Congress based on the names of sponsors (the “Taft-Hartley Act”) or by the press itself (“Obamacare”). They can be politicized (“death tax”), or serve as a kind of marketing tool (“USA-PATRIOT Act”). Some labels and identifiers become very closely associated with the things they label, becoming terms of art in their own right. Thus, it is common to refer to a “501(c)(3) nonprofit” or a “Subchapter K” partnership. Vernacular labels have particular importance for citizens, who often use them as input to search systems. At this writing, developers at the Sunlight Foundation have just started an initiative to collect such labels through crowdsourcing.

Bruce, T. R., and Richards, R. C. (2011). Adapting Specialized Legal Metadata to the Digital Environment: The Code of Federal Regulations Parallel Table of Authorities and Rules. Paper presented at ICAIL 2011: The 13th International Conference on Artificial Intelligence and Law. Slides at http://liicr.nl/qdWBWi . Full text available from the authors.