Section identifiers (LII)

The United States Code section number
-------------------------------------

NB - 3 JUN 2008 - DAS
The assumption below that alpha extensions are lower case is producing
some bugs in some interactive access routines. Of the 50,000 plus
section numbers in the USC, more than 14,000 have alpha extensions
using lower case. However, it is important to note that an additional
312 (as of this date) have alpha extensions using upper case. Since it
is almost always obligatory to stay case-aware in parsing the USC text,
and especially external text that seeks to cite the USC, this means
that parsers should carefully look for both upper and lower, not just
go "case insensitive."
There might well be a citation out there to an UC extension using the
usual lower case, and we should honor such a reasonable assumption, but
internally we need to remember the way it really is.
The 312 will be added in a separate post.
-------------------------------------

(Some observations, as of May 2008, by David Shetland as part of his
work with US Code processing for the Legal Information Institute)

A US Code "section number" is an identifier. It provides a label
which can be used to isolate one of the 50,000 sections of the Code for
special consideration. It is unique across one of the fifty (or so)
"titles" of the Code, so it is made unique across the Code by prefixing
with the corresponding title number. Thus, a standard citation to a
section of the US Code looks like the following...
9 USC 203
...which is to Title 9, Section 203, a real life US Code reference.
That's simple enough, but you don't have to have lived very long to
suspect that life in that pile of 50,000, after eighty years of
development, may not always be that simple.

The two main complicating factors are smaller collections within a title (like "chapter"), and insertions.

Effect of Chapters, etc.

When the number of sections within a title becomes substantial, it
is natural and necessary to group them by subjects that are in some
sense internal to the title. With even the smallest titles, there seems
to be a convention of three chapters (see Title 1, Title 9, and now
Title 6). The chapters usually introduce a jump in the sequence of
section numbers, to correspond to what seemed a natural boundary at the
time.

-- Clearly, in Title 1, it makes sense for chapter 1 to collect
sections 1 through 8, chapter 2 to collect sections 101 through 114,
and chapter 3 to collect sections 201 through 213.

-- Clearly, in Title 9, it makes sense for chapter 1 to collect
sections 1 through 16, chapter 2 to collect sections 201 through 208,
and chapter 3 to collect sections 301 through 307.

-- Clearly, in the very new Title 6, it makes sense for chapter 1 to
collect sections 101 through 103 (plus some things called subchapters,
a different subject entirely), chapter 2 to collect section 701 and
some subchapters, and chapter 3 to collect section 901 and some
subchapters. Yes, the subchapters have something to do with the strange
chapter-level section number jumps.

So what is really clear is that when it comes to labeling things, we
have nothing to clarify but clarity itself. But you knew that by
looking in your spare closet.

Effect of Insertions (new stuff happens in the middle)

The vigorous mouse-clickers amongst you have already noticed
something very important in the middle of the Title 1, Chapter 2
section number sequence:
101, 102, 103, 104, 105, 106, oops, 106a, 106b, 107,...
I just made up the explicit "oops" of course, but it really is in there - see the notes to learn the formal spelling of "oops"--
"1951—Act Oct. 31, 1951, ch. 655, § 2(a), 65 Stat. 710, added items 106a and 106b."
Title 1 is very small and quite stable, but old enough to have some of
the insertion effect on section numbers-so-called, namely, alphabetic
extensions.

How far does this go in big, old, unstable titles? Pretty far, but
there seems to be some system to it. Past performance does not
guarantee future results, but so far we've gotten away with the
following, based on careful rummaging through the 50,000.

A championship real section "number" is 12 USC 1749bbb-10c, which indicates a third level insertion.

The challenge is to make an efficient index, to answer questions like the following:
-- Does this section number exist?
-- If not, what is the "closest" that does exist?
-- What is its predecessor or successor?
-- What is its container (chapter, etc.)?
Oh, but isn't this what XML gives us almost for free? Yes, once you
have the XML. A project is underway to make the content sources be XML,
but for now, the best sources are the data that are used to typeset the
print volumes.

Analysis of the present set of all US Code section numbers indicates that the following four-part template is barely adequate:
(1) base: six decimal digit integer
(2) ext1: four character alphabetic field
(3) ext2: three decimal digit integer
(4) ext3: one character alphabetic field
This yields, with zero or dash filling of fields, the following version of 1749bbb-10c:
001749-bbb010c
...which naturally collates on most systems (notice the dash in the
normalized version is a fill character, and unrelated to the hyphen in
the raw "number").

So what about that hyphen in the wild section number? Our current
working principle is that it is obligatory if there is an extension-2.

Extension-1 is not obligatory for extension-2, which combined with
the required hyphen, means ambiguity when certain literature uses the
dash in a range citation. (Inside the US Code, the word "to" is used
consistently to indicate a range within a section number, "through" in
USC-internal ranged cross references.) Thus, a section number reference
of "10a-10c" in the general literature needs to be disambiguated by
textual context and/or target authority lookup, if available, between
the possibilities of one (or a "first") section or a range from section
10a to 10c.

The classic use case for needing a fast index of authoritative
section numbers is the one that forces me to pull out the Ugliest of
All section numbers - the one I call the "ranged" section number.

We want to be able, to use an extreme but real-life "section" from
current data, to cite "12 USC 1749bbb-10c", which may have crept into
someone's notes, although the relevant section heading is actually
"§§ 1749bbb-10a to 1749bbb-10d. Omitted"
indicating that the current state of affairs is that our target section
is in a range of sections that has been omitted. Ah, if only we had our
system in place to ring up the ancient section (which for "omitted"
might require more than old USC)! That would be wonderful, but even if
we did, chances are very good that what we really need is the current
note about the omission, which is here, not in some old document.

So, "1749bbb-10a to 1749bbb-10d" is our raw "section number" and it
gets parsed into the two ends of a range, with normalized forms of
"00001749-bbb010a" and "00001749-bbb010d" - and since these collate
very well and very quickly, it is easy not only to find out about the
two extremes, but that our cite is within the range. Bear in mind that
the current data set has very useful information about section
1749bbb-10c (in the "section" with the above ranged "section number"),
but nothing tagged or predictably structured at all.

Note:
To produce one of our "external IDs" like "usc_sec_01_00000101----000-"
(the basis for the URI for title 1, section 101) you need a little
more, but not much.
--end-for-now--