Ouch. I’ve just been looking over the last blog post on interoperability, which has all the charm of an underfed seagull on crystal meth. Squawk, squawk, squawk. Amid all the screeching in the last post, it’s a little hard to figure out what the point was. So I’ll just say it: folks, the future does not lie in putting up huge, centralized collections of caselaw . It lies in building services that can work across many individual collections put up by lots of different people in lots of different institutional settings. Let me say that again: the future does not lie in putting up huge, centralized collections of caselaw. It lies in building services that can work across many individual collections put up by lots of different people in lots of different institutional settings. Services like site-spanning searches, comprehensive current-awareness services, and a scad of interesting mashups in which we put caselaw, statutes and regulations alongside other stuff to make new stuff.

There are some services like that. AltLaw is one; so is the Public Library of Law; so are the Legal Research Engines that the Cornell Law Library runs; and I’m sure I’m omitting many more, including some we built here at the LII more than five years ago. Most are either “framers” ( who put a wrapper around multiple sites operated by others) or “spiders” (who, like Google, crawl other sites and federate the content in interesting ways). These are fairly blunt instruments — they don’t show much in the way of law-specific metadata — and the spiders in particular are hard to maintain. And there are really very few of them. There has not been very much building of distributed services in the legal-information world.

Why is that? It’s partly because trying to build and maintain site-spanning applications in the absence of standards is insane. Source material moves around. Sites disappear and reappear. Firewalls suddenly block your robot, then just as surprisingly stop, after you’ve spent two weeks finding an e-mail address for a webmaster who is wisely concealing her identity. Robots.txt files suddenly sport new policies. The subdirectory on site X that holds the decisions for the month of December changes its name from dec2007 to december2007. The name of the judge writing the opinion gets moved from the third line after the second <H3> in the document to the fourth line. And so on. And on. And on. The whole thing is a house of cards, because there are few common practices among sites and very little consistent practice on any site. This makes it very difficult to automate things, and things that can’t be automated won’t scale. In such an unstable environment, building services that remain reliable over the long term is very difficult. And (I speak from experience here) it’s mind-numbingly annoying, because the things that (frequently) break such services are trivial and preventable and numberless and arbitrary. For a programmer, it’s like being nibbled to death by ducks.

These are not new problems. Librarians and others concerned with long-term information availability have been discussing these issues since about fourteen minutes after the first Web site appeared. Reporters of decisions and court clerks have long settled similar issues in print publication, and are beginning to do so on the Web. But much more is needed, and faster. The Web is not going away, and Web publication of legal information should not be thought of as a kind of unfunded mandate delivered as a sop to those who don’t buy the books.

Sorry, the seagull started screaming again. Persistent little bugger.

Difficulty and tedium aside, few Internet legal-information providers have been interested in building distributed services. How come? It’s partly because we’re brainwashed by centralized models that are the legacy of many years’ reliance on Westlaw and Lexis. It’s partly because law people deliberately confuse that kind of branded centralization with authority — easy to do when those who grant “official status” use it as a form of barter, chiefly with those who operate large, centralized systems. It’s partly because, up until now, pulling everything together in one big heap has been easier than creating interoperability. And it’s mostly because we haven’t been paying attention.

More than a decade ago, the digital-library community began solving this problem, systematically and effectively. They were mostly dealing with another kind of heavily cross-referenced essay: not the judicial opinion, but the scientific pre-print. Many approaches were tried; some (like Dienst, which Brian Hughes and I built into a law-journal repository system a decade ago) were glorious failures. But ultimately these folks were successful because they realized several things:

You can unbundle services from repositories. This is what Google does. It doesn’t hold everything — it just indexes it. The same thinking applies to things like current-awareness services that need input from multiple sources. You can do that without holding everything yourself. Indeed, services like large-scale search will only work if you unbundle them from repositories. Early on, there were many attempts at federating search services that failed because the whole system was held to the performance of the weakest participant. As a practical matter, scaling past 100 sites just would not work no matter what.

Services can be made a lot better if they have metadata available to them, particularly metadata about where to find the documents the service addresses. This is the basis of Google Sitemaps and the other related site-mapping standards. As an idea, it goes back to at least 1992 and Archie, the system for discovering anonymous-FTP sites. An important side effect is that participation in such schemes give the repository operator greater exposure for her information; in a way it’s a form of marketing.

Issuing metadata in a standard format makes a lot of things easier — like developing harvesting tools, services, and anything that has to process the metadata. XML is a really good vehicle for this, because it can be validated and reliably processed. This makes new services much, much cheaper to build. And, if your metadata standard can be extended by well-understood technical means, so that communities can effectively customize it — well, you get a lot of leverage in the form of standardized toolsets and the like.

Most important, all of this can take place independently of administrative structures, institutional gaps, or any other incidental barrier. It doesn’t matter who the repository operators are, or where they are, or what sort of institution they’re affiliated with. No consortium or other administrative apparatus is needed. It is up to the service provider to decide what makes a useful aggregation. And that is a very scalable idea.

Let’s hope it’s salable as well as scalable, because of course it depends on network effects for most of its value. It’s working that way in the digital-libraries world, where OAI-PMH (the Open Archives Initiative Protocol for Metadata Harvesting) is now a basic standard used by hundreds if not thousands of sites. Creators of legal data can do likewise. The protocol is easy to implement, and at the LII we are making it even easier by building OAI implementations that can easily be bolted onto existing case-management systems and otherwise fed from existing repositories. If you’re interested, take a look at http://oai4courts.wikispaces.com, where we’re starting to put things together (including a reference implementation you can tour, and for which we will shortly release code that all can use).