‘The State Decoded’ Squeezes Rich Metadata Out of Boring Legal Codes

At first blush, state legal codes seem pretty simple. You’ve got titles, which are composed of chapters, which in turn comprise sections — or something very much like that. It’s a straightforward hierarchy, and you might not think that there’s a lot of interesting metadata to be extracted from them. But it turns out that a rich mesh of metadata lies just beneath the surface, and by mining that metadata, The State Decoded, a 2011 Knight News Challenge project, is creating an innovative method of navigating state codes.

Here are a few of the most interesting sources of metadata that the project is extracting so far:

Intercode Cross-References

Most sections of state codes contain references to other sections of the code. For instance, § 24.2-107 of the Virginia Code says:

Notice of each meeting shall be given to all board members either by the secretary or the member calling the meeting at least three business days prior to the meeting except in the case of an emergency as defined in § 2.2-3701. Notice shall be given to the public as required by § 2.2-3707. All meetings shall be conducted in accordance with the requirements of the Virginia Freedom of Information Act (§ 2.2-3700 et seq.) unless otherwise provided by this section.

In three sentences, there are three references to three other sections of the code, all in title 2.2, chapter 37, and all far removed from title 24.2 where the references are found. From that, we can infer a pretty strong connection between title 2.2, chapter 37 and our current section — which makes logical sense, since title 2.2, chapter 37 is the Virginia Freedom of Information Act, and the current section is about the administration of elections.

Word and Phrase Definitions

State codes are forever defining things. Many titles and chapters begin by laying out the legal definitions for dozens of key terms that are addressed within, and codes frequently start out with a long list of definitions for the whole code. Cataloging these terms and their definitions and mapping their usage throughout the code provide some powerful clues about which parts of the code are related to one another.

Legislative Origins

The bills that establish or amend laws — or even just attempt to do so — frequently affect multiple titles and chapters of the code. By analyzing these legislative efforts, it can be inferred that sections of the code are related. Also, the summaries for these bills can serve as sources of keywords to identify the cited sections of code, for automatic tagging of sections to aid in searching.

Visitor Behavior

The bills that people interact with on The State Decoded also help map the relationships between sections of code. Tracking anonymized usage data reveals usage patterns, a result of people engaged in goal-oriented research. (People don’t tend to casually browse legal data like this but, instead, seek out specific information.) By seeing which sections people tend to visit in a single session, it can be inferred that they are related.

Other sources of metadata are likewise useful: court decisions, scholarly articles, and attorney general opinions, for instance. These references, when connected together, constitute a powerful method of understanding the semantic relationship between different parts of a state code. This data can be shared with visitors to help guide them towards the information that they need, quickly and efficiently.

Even the most boring data structures can have some fascinating metadata structures underlying them. You’ve just got to know where to look.