April 29, 2013

Libraries and bookstores have perennially faced the problem of how to organize books on their shelves. There’s a tension between making certain books easy to find for readers with one set of interests, and making them more difficult to find for other readers. For instance, some libraries and bookstores near me have a section for African American fiction. Readers particularly interested in African American authors can easily find their books in this section. But if novels by African American authors are shelved there instead of in the general fiction section, readers browsing general fiction might not find many African American authors there. Similar issues have arisen with genre fiction sections in libraries. A separate “Science Fiction” section can be a convenient service for for fans of that genre. But some readers have objected that such sections push science fiction off into a corner, making it easy for “mainstream” readers to overlook the genre.

In theory, online libraries shouldn’t have as much problem organizing their books and subjects. Freed from the physical constraints of bound paper and shelves, the same book can be placed in many virtual locations, not just one. But in practice, many of the problems of categorization persist in the online world. Last week, for instance, Amanda Filipacchi noted in the New York Times that Wikipedia’s category listing of American novelists was disproportionally male, in part because some editors had been taking women authors out of this category and moving them to the more specialized “American women novelists” category. As far as I can tell, Wikipedia policy does not call for this sort of marginalization, but it doesn’t prevent it from happening either. It’s not just a matter of editors with an agenda and time on their hands; it also happens because manually filing people under multiple categories takes more effort than filing them under one, and it’s easy to neglect or forget to put someone in a broader category after placing them in a narrower one. So people are classified under women authors but not authors, under chemists but not scientists, under Catholics but not Christians. Readers who look for articles in the more general category listings can easily miss people who are only filed in the more specific ones. (And even if those category listings were not originally intended for browsing, many Wikipedia readers do use them that way.)

In systems that have explicit hierarchies of categories (such as Wikipedia categories, or Library of Congress Subject Headings), there’s a fairly straightforward way to solve this particular problem: When a person is placed in a specific category, the system should automatically also place them in any broader categories of people that encompass the original category. If someone is categorized under “Women chemists”, for instance, they should also get automatically categorized under “Chemists”, “Women scientists”, and “Scientists”. This inclusion can be implemented in various ways, but the important thing is that narrowly-classified people should be just as visible for readers browsing the broader categories as people that were explicitly classified under the broader categories.

We could be doing this sort of thing in other library catalogs, and in Wikipedia, as well. Why aren’t we? I’ve seen a few objections to the idea:

It’s too hard to implement? It doesn’t have to be. It took me just part of a Sunday afternoon to implement the feature on The Online Books Page, and I suspect a good programmer who was familiar with (and could modify) the relevant source code would not have much trouble implementing the feature in a well-designed catalog or Wiki. In my experience, I had to spend more time modifying my data than modifying my code. The Library of Congress Subject Headings, the subject system used by The Online Books Page, is not complete or consistent in its subject hierarchies, and I had also miscoded some topical subjects as people. But it’s possible to clean up and enhance this kind of data, and doing so often benefits both present and future applications of the data.

It defeats the purpose of hierarchical categories? I’ve seen this objection made in some of the Wikipedia discussions around this issue, and it doesn’t make sense to me when I think it through. Far from being useless, the category hierarchy is precisely what makes it possible to automatically promote people in narrow categories into broader categories. It also helps save the time of categorizers; they only have to explicitly place people in precise categories, and if the hierarchy is well-constructed the system will automatically take care of the broader categories. (If the system also keeps track of which category assignments are explicit and which are automatic, it can also update them appropriately when categorizations or hierarchies get edited.) I’m also not flattening hierarchies across the board; I’m only recommending at this point that this sort of promotion be done for people, in categories of people. (More generally, it might be useful for any kind of individual instance that is categorized under abstract classes of those instances. But doing it for people is a good start.)

It makes the broader categories too crowded to be useful? In a comprehensive catalog such as Wikipedia, there will be a lot of people in categories like “writers”, once you include all the people in sub-categories. But there still will be a lot of people in that category even if you banish all the women to a “women writers” subcategory. Creating another category for “men writers” doesn’t really solve the problem; all it does is force people to choose which gender they want to browse, instead of letting them browse writers of both genders if that’s what they want to do. And after the split, the broader “writers” category will most likely still be left with a random assortment of writers without gender classification, who might or might not be the people a reader is most interested in.

Well-designed interfaces make it possible to usefully browse large collections of items. Relevance ranking, for instance, can be used to put the most notable examples of a category at the top of a long list of its members. That’s in fact what we routinely expect to happen in good search engines. And mechanisms like faceted navigation (used in manyonlinecatalogs) and subject maps (used on The Online Books Page) make it easy to shift focus to more precise or related categories based on a reader’s interests. In systems that implement these features, categories with lots of members are good things to have, not bad things.

I haven’t yet implemented relevance ranking in my subject browsing. Right now, The Online Books Page doesn’t actually classify many people to begin with, so most of my categories don’t have a lot of people in them. But I could see a number of ways to implement such ranking in a catalog like The Online Books Page, or in Wikipedia, which I can discuss later if there’s interest.

In summary, then, well-designed catalogs and wikis should be able to categorize people comprehensively without marginalizing them. Three features that make this possible are:

detailed, well-organized systems of categories and their relationships

systems that automatically show people in broader categories when they’re classified in narrower ones

and ranking and navigation mechanisms that make it easy to pick out the people with the most general interest, or the qualities of interest to a particular researcher, from a large overall set of people.

I’ll continue to work on implementing these features on The Online Books Page, and would be very interested in participating in discussions of how they can better work there, in other catalogs, and in systems like Wikipedia.

March 4, 2013

I’ve heard the lament in more than one library discussion over the years. “People aren’t coming to our library like they should,” librarians have told me. “We’ve got a rich collection, and we’ve expended lots of resources on an online presence, but lots of our patrons just go to Google and Wikipedia without checking to see what we have.” The pattern of quick online information-finding using search engines and Wikipedia is well-known enough that it has its own acronym: GWR, for Google -> Wikipedia -> References. (David White gives a good description of that pattern in the linked article.)

Some people I’ve talked to think we should break this pattern. With the right search tool or marketing plan, some say, we can get patrons to start with us first, instead of Google or Wikipedia. This idea seems to me both futile and beside the point. Between them, Google and Wikipedia cover a vast array of online information, more than librarians could hope to replicate or index ourselves in that medium. Also, if we truly have better resources available in our libraries than can be found on the open Web, it’s less important that our researchers start from our libraries’ websites than that they end up finding the knowledge resources our libraries make available to them.

Looked at the right way, Wikipedia can be a big help in making online readers aware of their library’s offerings. One of the things we spend a lot of time on in libraries is organizing information into distinct, conceptual categories. That’s what Wikipedia does too: so far, their English edition has over 4 million concepts identified, described, and often populated with reference links. And Wikipedia has encouraged people to add links to relevant digital library collections on various topics, through programs like Wikipedia Loves Libraries and Wikipedian in Residence programs. But while these programs help bring some library resources online, and direct people to those selected resources, there’s still a lot of other relevant library material that users can’t get to via Wikipedia, but can via the libraries that are near them.

So how do we get people from Wikipedia articles to the related offerings of our local libraries? Essentially we need three things: First, we need ways to embed links in Wikipedia to the libraries that readers use. (We can’t reasonably add individual links from an article to each library out there, because there are too many of them– there has to be a way that each Wikipedia reader can get to their own favored libraries via the same links.) Second, we need ways to derive appropriate library concepts and local searches from the subjects of Wikipedia articles, so the links go somewhere useful. Finally, we need good summaries of the resources a reader’s library makes available on those concepts, so the links end up showing something useful. With all of these in place, it should be possible for researchers to get from a Wikipedia article on a topic straight to a guide to their local library’s offerings on that topic in a single click.

I’ve developed some tools to enable these one-click Wikipedia -> library transitions. For the first thing we need, I’ve created a set of Wikipedia templates for adding library links. The documentation for the Library resources box template, for instance, describes how to use it to create a sidebar box with links to resources about (or by) the topic of a Wikipedia article in a reader’s library, or in another library a reader might want to consult. (There’s also an option for direct links to my Online Books Page, if there are relevant books online; it may be easier in some cases for readers to access those than to access their local library’s books.)

For the links to work, we need to know about the reader’s preferred library. Users can register their preferred library (which will set a cookie in their browser recording that choice), or select it for each individual search. We know how to link to several dozen libraries so far, and can add more libraries on request. Worldcat.org, which includes holdings of thousands of libraries worldwide, is also an option. Besides the “Library resources box” template, I’ve also provided templates for in-text links to library resources, if those work better in a given article. Links to these templates can be found at the end of the “Library resources box” documentation.

For the second thing we need, I’ve created a library forwarding service (“Forward to Libraries”, or FTL– catchier name suggestions welcome) that transforms links from Wikipedia into searches for appropriate headings or keywords in local libraries. This is the same service I describe in my “From my library to yours” blog post from last month, but it now supports links from Wikipedia as well as to Wikipedia.

Thanks to information included in the Library of Congress’ Authorities and Vocabularies datasets, OCLC’s VIAF data feeds, Wikipedia’s database downloads, and my own metadata compiled at The Online Books Page, FTL already knows how to link directly to over 240,000 distinct authority-controlled headings known to the Library of Congress from their corresponding Wikipedia articles. (Library of Congress headings are used in most sizable US libraries, and many English-language libraries outside the US also use similar headings.)

For other articles, FTL by default will try a general keyword search based on the Wikipedia article’s title, which will often turn up useful results at the destination library. Alternatively, my templates allow Wikipedia editors to determine a specific Library of Congress heading to use in library links, if appropriate. I’m hoping to incorporate suggested headings into FTL’s own knowledge base as I detect them showing up in Wikipedia articles. I also plan to publish FTL’s data sets under open access terms, so that others can use and improve on them as well.

The third part of this solution– displaying relevant resources at the destination library– can be implemented differently at each library. For most of the libraries in FTL’s current knowledge base, links go to searches in the library’s regular online catalog. But with some libraries, I’ve linked to another discovery system, if it seems to be the main search promoted at that library, and it seems to produce useful results. The Online Books Page’s subject map displays also have features that I think will be useful to Wikipedia subject researchers arriving at my site, such as also showing related subjects and books filed under those subjects. I hope in future posts to talk more about other useful guideposts and contextual information we could be providing to readers arriving from Wikipedia.

But if you’ve read this far, you probably want to see how this all works in practice. So I’ve added some example library resources boxes in a few Wikipedia articles that seemed particularly relevant this month, including those for Women’s history, Elizabeth Cady Stanton, and Flannery O’Connor. Look down in the “External links” or “Further reading” sections of those articles for the boxes, and view the page source of the articles to see how those boxes are constructed.

As with most things related to Wikipedia, this service is experimental, and subject to change (and, hopefully, improvement) over time. I’d love to hear thoughts and suggestions from users and maintainers of Wikipedia and libraries. And if you find creating these sort of links from Wikipedia useful, and need help getting started, I’d be happy to help you bring them to your favorite Wikipedia topics and local libraries, as time permits.

October 29, 2010

Earlier this week, I participated in the Books Online workshop in Toronto. The workshop featured an update from James Crawford on Google Books, and papers on social reading and ebook interfaces, and the development of ebook services for various communities with special needs, including children, isolated First Nations communities, and the visually impaired. There were also reports from various other research projects, mostly outside the US. It was an especially good opportunity to get in touch with projects and people I don’t ordinarily encounter, since I don’t do much international travel at the moment.

My own contribution was a keynote titled “The Metadata Challenge“, discussing some ways in which metadata can be used effectively to support discovery, access, and usability in large-scale digital libraries. The talk covers a wide range of topics, including several metadata-related projects I’ve written about here, such as subject maps, copyright data, and work-oriented catalog views. I’ve posted slides and notes from the talk on my Metadata Challenge page, which you’re welcome to download and read.

This coming Tuesday, I’ll be giving a more tightly focused metadata talk at the Digital Library Federation forum in Palo Alto. I’ll be going into detail about how we use freely distributed linked authority data on subjects from the Library of Congress to improve discovery in catalogs being maintained and developed at Penn. In the same session, Kevin Ford of the Library of Congress will talk about recent changes and new improvements in the service we’re using.

I expect my presentation to include a live demo and other interactive elements, and we both hope to leave plenty of time for questions and discussion. If you’re interested in how you can use LC’s authority data, or data like it, to improve subject-oriented catalogs, I encourage you to attend the session if you can make it.

Comments Off

October 18, 2010

As we begin Open Access Week, it’s worth noting the importance of open access not only to research articles, data, teaching materials, and the like, but also to books. We are fortunate not only that millions of historic volumes are now openly accessible from various digitization projects, but also that many recent volumes are also available as open access from a variety of academic presses, government and nonprofit agencies, and other individuals and groups.

Part of the problem is that the library community faces its own open access issues with its cataloging data. Many libraries use OCLC’s WorldCat to collaborate on cataloging books, but WorldCat is not open access, as defined by projects like the Budapest Open Access Initiative (which uses a definition that includes free reuse and redistrihution of “open access” material by anyone). After months of debate (including some discussion on this blog), OCLC decided to adopt a policy that allows access and reuse of WorldCat-mediated data by OCLC members, but limits use and redistribution outside the membership.

A number of libraries and library-related organizations, however, have taken a more open approach. For instance, several German libraries and the biblios.net project make their bibliographic data available for reuse without restriction. The British Library is now making its bibliographic data generally available for non-commercial use. And the Open Knowledge Foundation has also released a draft of working principles for open bibliographic data, recommending that bibliographic data be made available with as few restrictions as possible (ideally, with public domain dedication).

Once you’ve opened your data, lots of people can reuse and adapt it in useful ways. For instance, I have harvested metadata provided without restriction by Hathi Trust on over 1 million freely readable online volumes they have in their digital collections. Today I have made it browsable and searchable on The Online Books Page. Not only does this let users search across lots of books digitized by Google, Microsoft, and various other projects large and small, but it also provides new ways of exploring the Hathi collection not previously possible, such as browsing through subject maps of the collection. (See, for instance, how you can explore various battles and campaigns of the American revolution, with both Hathi and non-Hathi titles.) My announcement on The Online Books Page has more details about the new Hathi books, and the new “extended shelves” that will eventually include additional collections as well.

The Hathi data I’m using is not as rich as full MARC catalog data would be. (I’m getting it from their OAI data export, which strips out some information from the original catalog records, and I’m currently using their Dublin Core data instead of their MARC data.) Fortunately, I can use other open data to make automated improvements to the data I get from Hathi. In particular, I’m using open subject authority data provided by the Library of Congress to automatically update many of the subject headings in the Hathi data, so that they’re compatible with present-day cataloging practice. (I describe the basic technique in a previous post.) In the future, I plan to use further data sources and automated methods to make author names and subject assignments for books more consistent and complete as well.

I hope the new extended shelves will be useful to users of both The Online Books Page and Hathi Trust’s online book collection. Others are free to reuse the same data I used to create similar, or better, book searching and browsing indexes. I’d like to thank Hathi Trust, the Library of Congress, Google, and the other digitization, preservation and copyright-clearance partners of Hathi for providing the open data that makes it possible to liberate all of these books. And I’d love to hear from readers browsing the new, extended online bookshelves.

July 31, 2010

In an earlier post, I discussed how I was using the open data from the Library of Congress’ Authorities and Vocabularies service to enhance subject browsing on The Online Books Page. More recently, I’ve used the same data to make my subjects more consistent and up to date. In this post, I’ll describe why I need to do this, and why doing it isn’t as hard as I feared that it might be.

The Library of Congress Subject Headings (LCSH) is a standard set of subject names, descriptions, and relationships, begun in 1898, and periodically updated ever since. The names of its subjects have shifted over time, particularly in recent years. For instance, recently subject terms mentioning “Cookery”, a word more common in the 1800s than now, were changed to use the word “Cooking“, a term that today’s library patrons are much more likely to use.

It’s good for local library catalogs that use LCSH to keep in sync with the most up to date version, not only to better match modern usage, but also to keep catalog records consistent with each other. Especially as libraries share their online books and associated catalog records, it’s particularly important that books on the same subject use the same, up-to-date terms. No one wants to have to search under lots of different headings, especially obsolete ones, when they’re looking for books on a particular topic.

Libraries with large, long-standing catalogs often have a hard time staying current, however. The catalog of the university library where I work, for instance, still has some books on airplanes filed under “Aeroplanes”, a term that recalls the long-gone days when open-cockpit daredevils dominated the air. With new items arriving every day to be cataloged, though, keeping millions of legacy records up to date can be seen as more trouble than it’s worth.

But your catalog doesn’t have to be big or old to fall out of sync. It happens faster than you might think. The Online Books Page currently has just over 40,000 records in its catalog, about 1% of the size of my university’s. I only started adding LC subject headings in 2006. I tried to make sure I was adding valid subject headings, and made changes when I heard about major term renamings (such as “Cookery” to “Cooking”). Still, I was startled to find out that only 4 years after I’d started, hundreds of subject headings I’d assigned were already out of date, or otherwise replaced by other standardized headings. Fortunately, I was able to find this out, and bring the records up to date, in a matter of hours, thanks to automated analysis of the open data from the Library of Congress. Furthermore, as I updated my records manually, I became confident I could automate most of the updates, making the job faster still.

Here’s how I did it. After downloading a fresh set of LC subject headings records in RDF, I ran a script over the data that compiled an index of authorized headings (the proper ones to use), alternate headings (the obsolete or otherwise discouraged headings), and lists of which authorized headings were used for which alternate headings. The RDF file currently contains about 390,000 authorized subject headings, and about 330,000 alternate headings.

Then I extracted all the subjects from my catalog. (I currently have about 38,000 unique subjects.) Then I had a script check each subject see if it was listed as an authorized heading in the RDF file. If not, I checked to see if it was an alternate heading. If neither was the case, and the subject had subdivisions (e.g. “Airplanes — History”) I removed a subdivision from the end and repeated the checks until a term was found in either the authorized or alternate category, or I ran out of subdivisions.

This turned up 286 unique subjects that needed replacement– over 3/4 of 1% of my headings, in less than 4 years. (My script originally identified even more, until I realized I had to ignore the simple geographic or personal names. Those aren’t yet in LC’s RDF file, but a few of them show up as alternate headings for other subjects.) These 286 headings (some of them the same except for subdivisions) represented 225 distinct substitutions. The bad headings were used in hundreds of bibliographic records, the most popular full heading being used 27 times. The vast majority of the full headings, though, were used in only one record.

What was I to replace these headings with? Some of the headings had multiple possibilities. “Royalty” was an alternate heading for 5 different authorized headings: “Royal houses”, “Kings and rulers”, “Queens”, “Princes” and “Princesses”. But that was the exception rather than the rule. All but 10 of my bad headings were alternates for only one authorized heading. After “Royalty”, the remaining 9 alternate headings presented a choice between two authorized forms.

When there’s only 1 authorized heading to go to, it’s pretty simple to have a script do the substitution automatically. As I verified while doing the substitutions manually, nearly all the time the automatable substitution made sense. (There were a few that didn’t: for instance. when “Mind and body — Early works to 1850″ is replaced by “Mind and body — Early works to 1800“, works first published between 1800 and 1850 get misfiled. But few substitutions were problematic like this– and those involving dates, like this one, can be flagged by a clever script.)

If I were doing the update over again, I’ll feel more comfortable letting a script automatically reassign, and not just identify, most of my obsolete headings. I’d still want to manually inspect changes that affect more than one or two records, to make sure I wasn’t messing up lots of records in the same way; and I’d also want to manually handle cases where more than one term could be substituted. The rest– the vast majority of the edits– could be done fully automatically. The occasional erroneous reassignment of a single record would be more than made up by the repair of many more obsolete and erroneous old records. (And if my script logs changes properly, I can roll back problematic ones later on if need be.)

Mind you, now that I’ve brought my headings up to date once, I expect that further updates will be quicker anyway. The Library of Congress releases new LCSH RDF files about every 1-2 months. There should be many fewer changes in most such incremental updates than there would be when doing years’ worth of updates all at once.

Looking at the evolution of the Library of Congress catalog over time, I suspect that they do a lot of this sort of automatic updating already. But many other libraries don’t, or don’t do it thoroughly or systematically. With frequent downloads of updated LCSH data, and good automated procedures, I suspect that many more could. I have plans to analyze some significantly larger, older, and more diverse collections of records to find out whether my suspicions are justified, and hope to report on my results in a future post. For now, I’d like to thank the Library of Congress once again for publishing the open data that makes these sorts of catalog investigations and improvements feasible.

Say you’d like to read some books about logic, for instance. You’d rather not have to go find and troll all the appropriate shelf sections within math, philosophy, psychology, computing, and wherever else logic books might be found in a physical library. And you’d rather not have to think of all the different keywords used to identify different logic-related topics in a typical online catalog. In my subject map for logic, you can see lots of suggestions of books filed both under “Logic” itself, and under related concepts. You can go straight to a book that looks interesting, select a related subject and explore that further, or select the “i” icon next to a particular book to find more books like it.

As I’ve noted previously, the relationships and explanations that enable this sort of exploration depend on a lot of data, which has to come from somewhere. In previous versions of my catalog, most of it came from a somewhat incomplete and not-fully-up-to-date set of authority records in our local catalog at Penn. But the Library of Congress (LC) has recently made authoritative subject cataloging data freely available on a new website. There, you can query it through standard interfaces, or simply download it all for analysis.

I recently downloaded their full data set (38 MB of zipped RDF), processed it, and used it to build new subject maps for The Online Books Page. The resulting maps are substantially richer than what I had before. My collection is fairly small by the standards of mass digitization– just shy of 40,000 items– but still, the new data, after processing, yielded over 20,000 new subject relationships, and over 600 new notes and explanations, for the subjects represented in the collection.

That’s particularly impressive when you consider that, in some ways, the RDF data is cruder than what I used before. The RDF schemas that LC uses omit many of the details and structural cues that are in the MARC subject authority records at the Library of Congress (and at Penn). And LC’s RDF file is also missing many subjects that I use in my catalog; in particular, at present it omits many records for geographic, personal, and organizational names.

Even so, I lost few relationships that were in my prior maps, and I gained many more. There were two reasons for this: First of all, LC’s file includes a lot of data records (many times more than my previous data source), and they’re more recent as well. Second, a variety of automated inference rules– lexical, structural, geographic, and bibliographic– let me create additional links between concepts with little or no explicit authority data. So even though LC’s RDF file includes no record for Ontario, for instance, its subject map in my collection still covers a lot of ground.

A few important things make these subject maps possible, and will help them get better in the future:

A large, shared, open knowledge base: The Library of Congress Subject Headings have been built up by dedicated librarians at many institutions over more than a century. As a shared, evolving resource, the data set supports unified searching and browsing over numerous collections, including mine. The work of keeping it up to date, and in sync with the terms that patrons use to search, can potentially be spread out among many participants. As an open resource, the data set can be put to a variety of uses that both increase the value of our libraries and encourage the further development of the knowledge base.

Making the most of automation: LC’s website and standards make it easy for me to download and process their data automatically. Once I’ve loaded their data, and my own records, I then invoke a set of automated rules to infer additional subject relationships. None of the rules is especially complex; but put together, they do a lot to enhance the subject maps. Since the underlying data is open, anyone else is also free to develop new rules or analyses (or adapt mine, once I release them). If a community of analyzers develops, we can learn from each other as we go. And perhaps some of the relationships we infer through automation can be incorporated directly into later revisions of LC’s own subject data.

Judicious use of special-purpose data: It is sometimes useful to add to or change data obtained from external sources. For example, I maintain a small supplementary data file on major geographic areas. A single data record saying that Ontario is a region within Canada, and is abbreviated “Ont.”, generates much of my subject map for Ontario. Soon, I should also be able to re-incorporate local subject records, as well as arbitrary additional overlays, to fill in conceptual gaps in LC’s file. Since local customizations can take a lot of effort to maintain, however, it’s best to try to incorporate local data into shared knowledge bases when feasible. That way, others can benefit from, and add on to, your own work.

Recently, there’s been a fair bit of debate about whether to treat cataloging data as an open public good, or to keep it more restricted. The Library of Congress’ catalog data has been publicly accessible online for years, though until recently only you could only get a little a time via manual searches, or pay a large sum to get a one-time data dump. By creating APIs, using standard semantic XML formats, and providing free, unrestricted data downloads for their subject authority data, LC has made their data much easier for others to use in a variety of ways. It’s improved my online book catalog significantly, and can also improve many other catalogs and discovery applications. Those of us who use this data, in turn, have incentives to work to improve and sustain it.

Making the LC Subject Headings ontology open data makes it both more useful and more viable as libraries evolve. I thank the folks at the Library of Congress for their openness with their data, and I hope to do my part in improving and contributing to their work as well.

January 15, 2010

I’ve now made a few posts about concept-oriented catalogs, describing the basic idea, showing some examples, and talking about the kinds of context they should provide for users. As I mentioned in my first post, concepts in such catalogs are “first-class locuses of information to help readers find useful knowledge resources”. The catalogs I’m describing include a variety of concepts (beyond the bibliographic record) that have data associated with them, and this data gives users a helpful context for finding appropriate knowledge resources.

As I said in my example post, “The concepts come from, and are maintained by, various groups of people…. [They] may be derived in part from existing MARC bibliographic metadata (sometimes through automated analysis), but often draw from additional data sources.”

If you’ve worked in cataloging lately, you might be thinking, “that’s nice, but we’ve got our hands full just providing MARC catalog records for all the books and other stuff coming through the door now. Where’s all this other ‘concept’ data going to come from? And how will it be practical to use and maintain?”

In this post, I’d like to take a stab at answering those questions. I’ll draw a lot from my experience with subject maps, but a lot of what I say should apply to other kinds of concept data as well.

The conceptual data behind subject maps consists of annotations on different subjects, and links between related subjects. A lot of what I need to build these maps can simply be reused from existing data. In particular, the Library of Congress Subject Headings system (LCSH) provides a large set of subjects with standardized names. We also have a set of authority records associated with those subjects that gave alternate names, notes, and links to related subjects.

To make it practical to build a subject map for this data, I bulk-loaded authority records from our local catalog. While the Library of Congress Authorities are more up to date than our local catalog, I could only look up records there one at a time, through an interface designed for manual browsing. Fortunately, since then the Library of Congress has provided ways to download subject authority data in bulk. It’s in a format that omits some details, but it should still help fill out our maps when we start including these records as well. Because our library and the Library of Congress are both using a common system of identifiers for subjects, as well as compatible formats for expressing subject relationships, I’ll be able to combine our authority information with theirs to provide useful maps. The identifiers we use are not always in sync; LCSH subject terms do get renamed and discontinued from time to time. But the cross-references in LCSH authority records, which often include the old terms as aliases of the new terms, help reduce the pain involved in moving from old terms to newer terms.

Subject maps built just on authority records turn out to be pretty generic, and not as useful as they could be. To make them more useful, we need more data. As I describe in more detail in this white paper, I also analyze our bibliographic corpus to see what subject terms we actually use in our catalog, look at the structure of those terms (which are often coordinated from multiple components), and also look at correlations between terms that get used together in the same bibliographic records. This analysis lets me create additional useful relationships between subjects. In short, I use automated analysis of a large data corpus to create new concept data from existing data.

In order to link together the many subjects that have geographic aspects, I need some extra data that isn’t in authority records. Once I created a data record that noted that “Pennsylvania” is a US state that gets abbreviated “Pa.” in some subject headings, I was able to build all kinds of relationships between “Philadelphia (Pa.)” and related subjects, none of which are directly stated in the authority records for these subjects, but all of which can be derived by automated analysis. (It helps that subject terms in LCSH have a fairly well-defined structure that’s amenable to lexical analysis.) A couple hundred other brief geographic data records are enough to let users zoom in and out of locations all over the globe. So a small amount of well-designed and curated supplementary data can often enhance lots of concepts, with minimal maintenance cost.

While I can easily zoom in and out between the US, Pennsylvania, Philadelphia, and locations within Philadelphia, I’d need more data to move side to side. I don’t have any data, for instance, that tells me that Philadelphia is right next to Camden, New Jersey. But fortunately, I can mine external data sources to find this information. I recently read about a source of public domain global map data, for instance, that I (or any other geographic-concept catalog builder) could use to link subjects or other resources to a world map.

Increasing amounts of public data are distributed online. If the data is public domain, or available with a liberal license, I don’t have to worry about legal roadblocks to downloading it, analyzing it, and using it in my own work. Sharing data helps everyone build not only smarter catalogs, but smarter applications of all kinds.

Data sharing does not always happen painlessly. I may have different concepts, or different names for concepts, than someone else whose data I might find useful. We may have different ideas about how to structure our data. But there are now systems that provide links between different names, and crosswalks between different structures that can help bridge the gap between my data and that of others.

With large enough corpuses of data to draw on, I can even make use of unstructured information from large groups of ordinary users. For example, LibraryThing’s tag cloud displays a number of terms that are useful to include in one’s own library catalog. Not all of them are formally defined subjects, but they’re used enough that we should expect most of them to be used in patron searches. It should be possible to analyze the cloud and the things tagged in the cloud to associate many informal terms with particular subjects or library resources.

To summarize, it becomes much easier to derive the data needed for concept-oriented catalogs if

We have stable (or at least smoothly evolving) identifiers for concepts

We can use, swipe, and reuse a large domain of [meta]data for concept analysis (including automated analysis)

We carefully consider what additional concept data would enhance our services, and use standard, recognized forms to represent it

We have correspondences and crosswalks between different concept identifiers and formats

We share our concept data (and bibliographic data in general) as openly and broadly as possible

And we share information, expertise, and code that supports the innovative, useful catalogs we build.

There’s a non-trivial technical infrastructure implied by these requirements. But it’s one that we can build. (Quite a bit of it’s in place already.) A lot of it depends on a healthy social infrastructure to create, maintain, share, and work with all the data and services that we create and adopt. I hope to talk more about this social infrastructure in future posts.

“How should we offer searching in library collections?” is a question that lots of libraries are asking. The answer heard a lot nowadays is “Facets!” Facets have been used in databases and e-commerce sites for some years now. Essentially, they define several (ideally independent) attributes for items, and then let users zero in on what they want by selecting and deselecting various attributes. For example, if you go to Amazon to buy shoes, you can select values from facets like brand, size, color, and price range. Try different selections, and you can quickly pick out the few pairs that best meet your needs out of the tens of thousands offered on the site. (Assuming you’re willing to buy shoes without trying them on.)

The Endeca catalog at NC State applies the same idea to finding books in the library. When it came out two years ago, lots of library folks got excited. And when open source tools like Solr made it easy to code up your own faceted catalog, it came as no surprise that lots of folks set out to try facet-based discovery for their collections. These new catalogs are in many ways big improvements over existing catalogs. Though, as K. G Schneider and others point out, that’s not a highbar to clear.

Some have said that subject headings should change to be more facet-oriented. That’s the recommendation of the Calhoun Report commissioned by the Library of Congress that was released in 2006, which recommended dismantling the Library of Congress Subject Headings (LCSH), now the most common subject headings vocabulary. The more recent report from the Future of Bibliographic Control doesn’t go that far, but it does recommend transforming LCSH, “de-coupling subject strings” and evaluating LCSH’s ability to “support faceted browsing and discovery”. The FAST system, which breaks up subjects into uncoordinated facets, is mentioned as an interesting technology to pursue.

LCSH indeed has several problems associated with it: people have a hard time finding the appropriate subject terms for what they’re looking for; catalogers have a hard time constructing terms that follow all the LCSH rules; terms are used inconsistently across collections; terms are slow to adapt to contemporary usage; and both “traditional” and faceted library catalogs have a hard time connecting related terms together using LCSH.

Should we, then, dismantle LCSH into a simple system of facet sets? Not so fast, I say. Subjects are inherently messy things, neither fully discrete nor hierarchical, and in a large collection it’s important to be able to zero in on specific subjects through relationships. Not only is there a large installed base of materials already described with LCSH, but LCSH and ontologies like it allow books to be described with greater precision, and with richer relationships, than pure facets allow. (See Thomas Mann’s “The Peloponnesian War and the Future of Reference, Cataloging, and Scholarship in Research Libraries” for a spirited argument for the power of LCSH-style subject headings.)

What we really need are better tools that allow readers and catalogers to take full advantage of rich subject headings and relationships, and make it easier for subject headings systems to evolve more quickly to meet the needs of users. A technology I’m experimenting with now, and calling subject maps, involves networks of related subjects, techniques for enriching those networks through automation and user input, and displays that let users and librarians browse large collections by navigating through complex subject areas. Subject maps can play well with facets and user-assigned tags, to produce discovery systems that offer the best features of all of these technologies.

Too good to be true? If you want to hear more, see a demo, or ask how this would actually work, come see and/or heckle me on Saturday at ALA. I’ll be presenting at the Catalog Form and Function Interest Group, at 10:30 AM in the Versailles Room of the Sofitel Philadelphia. For more info, and for other ALA forums that may be of interest to metadata librarians, see this post on the ALA blog.