Libraries for everyone, by everyone, shared with everyone, about everything

Concepts in catalogs: Where the data comes from

I’ve now made a few posts about concept-oriented catalogs, describing the basic idea, showing some examples, and talking about the kinds of context they should provide for users. As I mentioned in my first post, concepts in such catalogs are “first-class locuses of information to help readers find useful knowledge resources”. The catalogs I’m describing include a variety of concepts (beyond the bibliographic record) that have data associated with them, and this data gives users a helpful context for finding appropriate knowledge resources.

As I said in my example post, “The concepts come from, and are maintained by, various groups of people…. [They] may be derived in part from existing MARC bibliographic metadata (sometimes through automated analysis), but often draw from additional data sources.”

If you’ve worked in cataloging lately, you might be thinking, “that’s nice, but we’ve got our hands full just providing MARC catalog records for all the books and other stuff coming through the door now. Where’s all this other ‘concept’ data going to come from? And how will it be practical to use and maintain?”

In this post, I’d like to take a stab at answering those questions. I’ll draw a lot from my experience with subject maps, but a lot of what I say should apply to other kinds of concept data as well.

The conceptual data behind subject maps consists of annotations on different subjects, and links between related subjects. A lot of what I need to build these maps can simply be reused from existing data. In particular, the Library of Congress Subject Headings system (LCSH) provides a large set of subjects with standardized names. We also have a set of authority records associated with those subjects that gave alternate names, notes, and links to related subjects.

To make it practical to build a subject map for this data, I bulk-loaded authority records from our local catalog. While the Library of Congress Authorities are more up to date than our local catalog, I could only look up records there one at a time, through an interface designed for manual browsing. Fortunately, since then the Library of Congress has provided ways to download subject authority data in bulk. It’s in a format that omits some details, but it should still help fill out our maps when we start including these records as well. Because our library and the Library of Congress are both using a common system of identifiers for subjects, as well as compatible formats for expressing subject relationships, I’ll be able to combine our authority information with theirs to provide useful maps. The identifiers we use are not always in sync; LCSH subject terms do get renamed and discontinued from time to time. But the cross-references in LCSH authority records, which often include the old terms as aliases of the new terms, help reduce the pain involved in moving from old terms to newer terms.

Subject maps built just on authority records turn out to be pretty generic, and not as useful as they could be. To make them more useful, we need more data. As I describe in more detail in this white paper, I also analyze our bibliographic corpus to see what subject terms we actually use in our catalog, look at the structure of those terms (which are often coordinated from multiple components), and also look at correlations between terms that get used together in the same bibliographic records. This analysis lets me create additional useful relationships between subjects. In short, I use automated analysis of a large data corpus to create new concept data from existing data.

In order to link together the many subjects that have geographic aspects, I need some extra data that isn’t in authority records. Once I created a data record that noted that “Pennsylvania” is a US state that gets abbreviated “Pa.” in some subject headings, I was able to build all kinds of relationships between “Philadelphia (Pa.)” and related subjects, none of which are directly stated in the authority records for these subjects, but all of which can be derived by automated analysis. (It helps that subject terms in LCSH have a fairly well-defined structure that’s amenable to lexical analysis.) A couple hundred other brief geographic data records are enough to let users zoom in and out of locations all over the globe. So a small amount of well-designed and curated supplementary data can often enhance lots of concepts, with minimal maintenance cost.

While I can easily zoom in and out between the US, Pennsylvania, Philadelphia, and locations within Philadelphia, I’d need more data to move side to side. I don’t have any data, for instance, that tells me that Philadelphia is right next to Camden, New Jersey. But fortunately, I can mine external data sources to find this information. I recently read about a source of public domain global map data, for instance, that I (or any other geographic-concept catalog builder) could use to link subjects or other resources to a world map.

Increasing amounts of public data are distributed online. If the data is public domain, or available with a liberal license, I don’t have to worry about legal roadblocks to downloading it, analyzing it, and using it in my own work. Sharing data helps everyone build not only smarter catalogs, but smarter applications of all kinds.

Data sharing does not always happen painlessly. I may have different concepts, or different names for concepts, than someone else whose data I might find useful. We may have different ideas about how to structure our data. But there are now systems that provide links between different names, and crosswalks between different structures that can help bridge the gap between my data and that of others.

With large enough corpuses of data to draw on, I can even make use of unstructured information from large groups of ordinary users. For example, LibraryThing’s tag cloud displays a number of terms that are useful to include in one’s own library catalog. Not all of them are formally defined subjects, but they’re used enough that we should expect most of them to be used in patron searches. It should be possible to analyze the cloud and the things tagged in the cloud to associate many informal terms with particular subjects or library resources.

To summarize, it becomes much easier to derive the data needed for concept-oriented catalogs if

We have stable (or at least smoothly evolving) identifiers for concepts

We can use, swipe, and reuse a large domain of [meta]data for concept analysis (including automated analysis)

We carefully consider what additional concept data would enhance our services, and use standard, recognized forms to represent it

We have correspondences and crosswalks between different concept identifiers and formats

We share our concept data (and bibliographic data in general) as openly and broadly as possible

And we share information, expertise, and code that supports the innovative, useful catalogs we build.

There’s a non-trivial technical infrastructure implied by these requirements. But it’s one that we can build. (Quite a bit of it’s in place already.) A lot of it depends on a healthy social infrastructure to create, maintain, share, and work with all the data and services that we create and adopt. I hope to talk more about this social infrastructure in future posts.