2012-05-15

Here's a two-part post, after a very long absence from the blog. I have been busy with several other projects, but I am gearing up to participate in the "legal hacks" event (see http://legalhacks.org) very soon, and as a result, am revisiting some issues related to organizing code-like legal materials.

The first part of this post discusses organizing materials once they have been obtained, and the second part discussed some of the challenges in obtaining them. A bit backwards, perhaps, but the first part is likely more relevant to the upcoming event.

To rehash the old adage that making laws is like making sausage, organizing them after the fact is very much like eating it, and below is my general method for doing that, for whatever it's worth. As always, I welcome any feedback, questions, or suggestions.

a) “Unique naming”, i.e., assigning
a specific name to a legal provision within a system

b) “Navigational reference”, which
is similar to navigating a filesystem

c) “Retrieval hook/container label”,
i.e., to use a citation as a placeholder to aggregate lower-level
content that is stored in other locations/records

d) “Thread tag/associative marker”,
i.e., grouping of related documents in “threads”; one example he
uses is a “captive search” URI, but in my view, this is mainly
another way to get at a retrieval hook

e) “Process milestone”, i.e.,
inferring some meaning from the official status of a document, e.g.,
if a bill has been assigned a Public Law number, it has presumably
been enacted into law.

f) “Proxy for provenance”, e.g.,
the existence of a bill number means that legislation has been
officially noticed in some way.

g) “Popular names, professional terms
of art, and other vernacular uses”, e.g., the Social Security Act,
the Stark Law, the Anti-Kickback Statute (to use some of the examples
with which I am most familiar).

Mr. Vergottini goes into the issues surrounding selecting frameworks to be used to actually implement those kinds of identifiers, e.g.,
via a URN or URL-based system, and discusses some of the difficulties
inherent in selecting and implementing a system to capture relevant
data in a machine-readable way. He also identifies problems with
viewing different portions of text, as well as tracking text that
gets amended or redesignated.

Common problems Messrs. Bruce and
Vergottini both discuss include documents/provisions with identical
names/identifiers in an official classification system (e.g., the two
subparagraphs 42 U.S.C. § 1320a-7b(b)(3)(H) that coexisted for seven
years until fixed by Pub. L. 111-148 § 3301), or how to store
temporally different versions of text.

I started building the ontolawgy™platform (a web-based legal analysis system) about 6 years ago for my regulatory practice, and I ran into
the problems discussed above quite early. Here are some of the
approaches I have taken to address them:

Treat every textual division as a
unique document, and allow it to be accessed via a unique URL based
on its location in the government taxonomy (a - c in Mr. Bruce's
overview).

Store each descriptive element
about that document in a tag/field. This includes official and
unofficial “popular names” (e.g., the Social Security Act),
section numbers within those popular names, section numbers of the
U.S. Code, Public Law enacting provisions, etc. (c - g)

Allow users to query on any of
those elements. (a - g)

Track duplicates and give them
distinct records that are still retrieved in an appropriate way
using their descriptive tags/fields. (a - d, g)

Track each provision using its current designation, but maintain a full locative, temporal, and ontological history within the record and the system. (a - e, g) For example, 42 U.S.C. § 1320a-7b(b)(3)(I) used to be the second 42 U.S.C. 1320a-7b(b)(3)(H) that was enacted by Pub. L. 108-173 § 431 (the first subparagraph (H) was enacted by § 237 of the same Public Law); the system tracks all that information and allows users to query it and, e.g., gather together all historical versions of subparagraph (H) to track how it has changed over time.

As for the mechanics, when I started building the system, my main goal was to get up and running quickly with a free, open-source, off-the-shelf system. The system is extremely flexible, has a very active development community, and still works quite well. While it does not currently use any sort of (proposed) standard like URN:lex or Akoma Ntoso, it does use inline markup, and thus, should be easily convertible to a legal markup standard once one is in place.

I can't go into much more detail here,
but please contact me
to get access to my demo system if you would like to see it in
action.

Part II: Obtaining legal source materials, or, how the government makes sausage even messier

All that said, one significant
challenge I still face is getting rational raw data from
official sources. Indentation can be highly relevant semantically,
depending on the subject matter, but official sources either just do
away with indentation altogether (I'm looking at you, Code
of Federal Regulations) or present it in such an inconsistent
format that it might as well not be there (U.S.
Code).

Back to the sausage. Essentially, we pay the government to make legal sausage, cook the sausage, and serve it to us, but just before they
serve it, they mash it up, smear it around the plate, then take away
our silverware and tie our hands behind our backs. I spend much more
time than should be necessary simply ensuring that the materials I
work with are properly indented to accurately reflect their meaning. I've
written several small programs to do about 95% of the work, but that
remaining 5% can be almost maddening, particularly when dealing with
multiple levels of unenumerated flush text. The materials are
certainly drafted with visible indentation (take a look at Public
Laws: all the indentation is there and correct), but all this useful
information gets stripped out at some point in the publication
process, and it is not at all clear to me why this happens. The U.S.
Code uses “bell
codes” for typesetting print documents, but this doesn't excuse
the lack of indentation in electronic publications.

The C.F.R. is even more maddening: This
document
claims that the XML format of the C.F.R. “is a complete and faithful representation of the Code of Federal Regulations, which
matches most closely to the author's original intent... [and] fully describes the
structure of the Code of Federal Regulations, including the large
structure (chapters, parts, sections, etc.), the document structure
(paragraphs, etc.), and
semantic structure” then goes on to
explain that the SGML indentation for subsections, paragraphs,
subparagraphs, clauses, etc. have all been collapsed to a the same
single tag. This means that every last bit of
indentation/separation (except for line breaks) within each section—“sections” can be very long and complex, with multiple nested levels of semantically-relevant indentation—has been completely stripped from all publicly-available electronic
materials. How is this supposed to help the public?

While the LII's sites offer a valuable
public service, they do not solve the underlying problem: Properly
indented content is not freely available to the public from the
government for commercial re-use, even though these government works
are in the public domain. Why is this a problem? Because official
platitudes notwithstanding, government publications significantly
obscure or corrupt the intended meaning and scope of the laws that
govern us.

If anyone has some insight about how to
get the government to bring useful and accurate indentation to its official publications,
please get in touch, I
would be thrilled to work with you to help make this happen.