Background

Like a lot of librarians, I have access to a lot of data, and sometimes no idea how to analyze it. When I learned about linked data and the ability to search against data sources with a piece of software called OpenRefine, I wondered if it would be possible to match our users’ discovery layer queries against the Library of Congress Subject Headings. From there I could use the linking in LCSH to find the Library of Congress Classification, and then get an overall picture of the subjects our users were searching for. As with many research projects, it didn’t really turn out like I anticipated, but it did open further areas of research.

At California State University, Fullerton, we use an open source application called Xerxes, developed by David Walker at the CSU Chancellor’s Office, in combination with the Summon API. Xerxes acts as an interface for any number of search tools, including Solr, federated search engines, and most of the major discovery service vendors. We call it the Basic Search, and it’s incredibly popular with students, with over 100,000 searches a month and growing. It’s also well-liked – in a survey, about 90% of users said they found what they were looking for. We have monthly files of our users’ queries, so I had all of the data I needed to go exploring with OpenRefine.

OpenRefine

OpenRefine is an open source tool that deals with data in a very different way than typical spreadsheets. It has been mentioned in TechConnect before, and Margaret Heller’s post, “A Librarian’s Guide to OpenRefine” provides an excellent summary and introduction. More resources are also available on Github.

One of the most powerful things OpenRefine does is to allow queries against open data sets through a function called reconciliation. In the open data world, reconciliation refers to matching the same concept among different data sets, although in this case we are matching unknown entities against “a well-known set of reference identifiers” (Re-using Cool URIs: Entity Reconciliation Against LOD Hubs).

Reconciling Against LCSH

In this case, we’re reconciling our discovery layer search queries with LCSH. This basically means it’s trying to match the entire user query (e.g. “artist” or “cost of assisted suicide”) against what’s included in the LCSH linked open data. According to the LCSH website this includes “all Library of Congress Subject Headings, free-floating subdivisions (topical and form), Genre/Form headings, Children’s (AC) headings, and validation strings* for which authority records have been created. The content includes a few name headings (personal and corporate), such as William Shakespeare, Jesus Christ, and Harvard University, and geographic headings that are added to LCSH as they are needed to establish subdivisions, provide a pattern for subdivision practice, or provide reference structure for other terms.”

I used the directions at Free Your Metadata to point me in the right direction. One note: the steps below apply to OpenRefine 2.5 and version 0.8 of the RDF extension. OpenRefine 2.6 requires version 0.9 of the RDF extension. Or you could use LODRefine, which bundles some major extensions and I hear is great, but personally haven’t tried. The basic process shouldn’t change too much.

(1) Import your data

OpenRefine has quite a few file type options, so your format is likely already supported.

(2) Clean your data

In my case, this involves deduplicating by timestamp and removing leading and trailing whitespaces. You can also remove weird punctuation, numbers, and even extremely short queries (<2 characters).

If you’ve done it correctly, you should see an RDF dropdown next to Freebase.

(4) Decide which data you’d like to search on.

In this example, I’ve decided to use just queries that are less than or equal to four words, and removed duplicate search queries. (Xerxes handles facet clicks as if they were separate searches, so there are many duplicates. I usually don’t, though, unless they happen at nearly the same time). I’ve also experimented with limiting to 10 or 15 characters, but there were not many more matches with 15 characters than 10, even though the data set was much larger. It depends on how much computing time you want to spend…it’s really a personal choice. In this case, I chose 4 words because of my experience with 15 characters – longer does not necessarily translate into more matches. A cursory glance at LCSH left me with the impression that the vast majority of headings (not including subdivisions, since they’d be searched individually) were 4 words or less. This, of course, means that your data with more than 4 words is unusable – more on that later.

(5) Go!

(6) Now you have your queries that were reconciled against LCSH, so you can limit to just those.

Finding LC Classification

First, you’ll need to extract the cell.recon.match.id – the ID for the matched query that in the case of LCSH is the URI of the concept.

At this point you can choose whether to grab the HTML or the JSON, and create a new column based on this one by fetching URLs. I’ve never been able to get the parseJson() function to work correctly with LC’s JSON outputs, so for both HTML and JSON I’ve just regexed the raw output to isolate the classification. For more on regex see Bohyun Kim’s previous TechConnect post, “Fear No Longer Regular Expressions.”

On the raw HTML, the easiest way to do it is to transform the cells or create a new column with:

You’ll note this will only pull out the first classification given, even if some have multiple classifications. That was a conscious choice for me, but obviously your needs may vary.

(Also, although I’m only concentrating on classification for this project, there’s a huge amount of data that you could work with – you can see an example URI for Acting to see all of the different fields).

Once you have the classifications, you can export to Excel and create a pivot table to count the instances of each, and you get a pretty table.

Caveats & Further Explorations

As you can guess by the y-axis in the table above, the number of matches is a very small percentage of actual searches. First I limited to keyword searches (as opposed to title/subject), then of those only ones that were 4 or fewer words long (about 65% of keyword searches). Of those, only about 1000 of the 26000 queries matched, and resulted in about 360 actual LC Classifications. Most months I average around 500, but in this example I took out duplicates even if they were far apart in time, just to experiment.

One thing I haven’t done but am considering is allowing matches that aren’t 100%. From my example above, there are another 600 or so queries that matched at 50-99%. This could significantly increase the number of matches and thus give us more classifications to work with.

Some of this is related to the types of searches that students are doing (see Michael J DeMars’ and my presentation “Making Data Less Daunting” at Electronic Resources & Libraries 2014, which this article grew out of, for some crazy examples) and some to the way that LCSH is structured. I chose LCSH because I could get linked to the LC Classification and thus get a sense of the subjects, but I’m definitely open to ideas. If you know of a better linked data source, I’m all ears.

I must also note that this is a pretty inefficient way of matching against LCSH. If you know of a way I could download the entire set, I’m interested in investigating that way as well.

Another approach that I will explore is moving away from reconciliation with LCSH (which is really more appropriate for a controlled vocabulary) to named-entity extraction, which takes natural language inputs and tries to recognize or extract common concepts (name, place, etc). Here I would use it as a first step before trying to match against LCSH. Free Your Metadata has a new named-entity extraction extension for OpenRefine, so I’ll definitely explore that option.

Planned Research

In the end, although this is interesting, does it actually mean anything? My next step with this dataset is to take a subset of the search queries and assign classification numbers. Over the course of several months, I hope to see if what I’ve pulled in automatically resembles the hand-classified data, and then draw conclusions.

So far, most of the peaks are expected – psychology and nursing are quite strong departments. There are some surprises though – education has been consistently underrepresented, based on both our enrollment numbers and when you do word counts (see our presentation for one month’s top word counts). Education students have a robust information literacy program. Does this mean that education students do complex searches that don’t match LCSH? Do they mostly use subject databases? Once again, an area for future research, should these automatic results match the classifications I do by hand.

What do you think? I’d love to hear your feedback or suggestions.

About Our Guest Author

Jaclyn Bedoya has lived and worked on three continents, although currently she’s an ER Librarian at CSU Fullerton. It turns out that growing up in Southern California spoils you, and she’s happiest being back where there are 300 days of sunshine a year. Also Disneyland. Reach her @spamgirl on Twitter or jaclynbedoya@gmail.com

What are libraries doing (or not doing) about linked data? This was the question that the W3C Library Linked Data Incubator Group investigated between May 2010 and August 2011. In this post, I will take a look at the final report of the W3C Library Linked Data Incubator Group (October 2011) and provide an overview of their recommendations and my own analysis of the issues. Incubator Groups were a program that the W3C ran from 2006-2012 to get work done quickly on innovative ideas where there wasn’t enough to actually begin working on creating the web standards for which the W3C exists. (The Incubator Group program has transitioned into Community and Business Groups).

In this report, the participants in the group made several key recommendations aimed at library leaders, library standards bodies, data and systems designers, and librarian and archivists. The recommendations indicate just how far we are from really being able to implement open linked data in every library but also reveal the current landscape.

Library Leaders

An illustration of the VIAF authority file for Jane Austen

The report calls on library leaders to identify potentially very useful sets of data that can be exposed easily using current practices. That is, they should not try to revolutionize workflows, but to evolve towards more linked data. They mention authority files as an example of a data set that is ideal for this purpose, since authority files are lists of real world people with attributes that connect to real things. Having some semantic context for authority files helps–we could imagine a scenario in which you are searching for a common name, but the system recognizes that you are searching for a twentieth century American author and so does not show you a sixteenth century British author. Catalogers don’t necessarily have to do anything differently, either, since these authority files can link to other data to make a whole
picture. VIAF (Virtual International Authority File) is a project between OCLC and several national libraries to create such a linked international authority file using linked data and enter into the semantic web.

Library leadership must face the issue of rights in an open data world. It is a trope that libraries hold much valuable cultural and bibliographic data. Yet in many cases we have purchased or leased this data from a vendor rather than creating it ourselves (certainly we do in the case of indexes and often with catalog records)–and the license terms may not allow for open sharing of the data. We must be aware that exposing linked data openly is probably not going to mesh well with the way we have done things traditionally. Harvard recently released 12 million bibliographic records under a CC0 (public domain) license. Many libraries might not be in the position to release their own bibliographic records if they did not create them originally. Of course the same goes for indexes or bibliographies, other categories of traditional library materials that seems ripe for linking semantically. Library leadership will have to address this before open linked data is truly possible.

Library Standards Bodies

The report calls on library standards bodies to attack the problem from both sides. First, librarians need to be involved with standardizing semantic web technologies in a way that meets their needs and ensures that the library world stays in line with the way the technology is moving generally. Second, creators of library data standards need to ensure that those standards are compatible with semantic web technologies. Library data, when encoded in MARC, combines meaning and the structure in one unit. This works well for people who are reading the data, but is not easy for computers to parse semantically. For instance, consider:

245 10|aPride and prejudice /|cby Jane Austen.
which viewed in the browser or on the catalog card like:

Pride and prejudice / by Jane Austen.

The 245 tells us that this is a main title, and then the 1 tells us there is an added entry, in this case for Jane Austen. The 0 tells us that the title doesn’t begin with an article, or “nonfiling character”. The |a gives the actual title, followed by a / character, and then the |c is the statement of responsibility, followed by a period. Note that there is semantic meaning mixed together with punctuation and words that are helpful for people, such as “by”, which follow the rules of AACR2. There are good reasons for these rules, but the rules were meant to serve the information needs of humans. Given the capabilities of computers to parse and present structured data meaningfully to humans, it seems vital to make library data understandable to computers and know that we can use it to make something more useful to people. You may have noticed that HTML has changed over the past few years in the same way that library data will have to change. If you, for instance, want to give emphasis to a word, you use the <em></em> tags. People know the word is emphasized because it’s in italics, the computer knows it’s emphasized because you told it that it was. Indicating that a word should be italicized using the <i></i> tags looks the same to a human reader who can understand the context for the use of italics, but doesn’t tell the computer that the word is particularly important. HTML 5 has even more use of semantic tags to make more of the standard ways of presenting information on the web meaningful to computers.

Systems Designers

The recommendations for data and systems designers are to start building tools that use linked data. Without a “killer app”, it’s hard to get excited about semantic technologies. Just after my last post went up, Google released its “Knowledge Graph”. This search takes words that traditionally would be matched as words, and matches them with “things.” For instance, if I type the search string Lincoln Hall into Google. Google guesses that I probably mean a concert venue in Chicago with that name and shows me that as the first result. It also displays a map, transit directions, reviews, and an upcoming schedule on the sidebar–certainly very convenient if that’s what I was looking for. But below the results for the concert venue, I get a box stating “See results about Lincoln Hall, Climber.” When I click on this, my results change to information about the Australian climber who recently died, and the side bar changes to information about him. Now as a librarian, I know that there would have been many ways to improve my search. But because semantic web technologies allow Google’s algorithms to understand that despite having the same name, an entity of a concert venue and a mountaineer are very different. This neatly disposes of the need for sophisticated searching for facts about things. Whether this is, indeed, revolutionary remains to be seen. But try it as a user. You might be pleasantly surprised by how it makes your search easier. It may be that web-scale discovery will do the same thing for libraries, but this is a tool that remains out of reach of many libraries.

Librarians and Archivists

Librarians and archivists have, as always, a duty to collect and preserve linked data sets. We know how valuable the earliest examples of any piece of data storage are–whether it’s a clay tablet, a book, or an index. We create bibliographies to see how knowledge changed over time or in different contexts. We need to be careful to preserve important data sets currently being produced, and maintain them over time so they remain accessible for future needs. But there’s another danger inherent in not being scrupulous about data integrity. Maintaining accurate and diverse data sets will help keep future information factual and unbiased. When a fact is one step removed from its source, it becomes even more difficult to check it for accuracy. While outright falsehood or misstatement is possible to correct, it will also be important to present alternate perspectives to ensure that scholarship can progress. (For an example of the issues in only presenting the most mainstream understanding of history, see “The ‘Undue Weight’ of Truth on Wikipedia”). If linked data doesn’t help us find out anything novel, will there have been a point in linking it?

Original image available at https://developers.facebook.com/docs/opengraph/

Librarians need to understand what the semantic web is and how to use it, but this can be challenging. While the promise of the semantic web has existed for over a decade, to the uninitiated there may not seem to be many implementations that are accessible to the average person.

One implementation that most people use daily is Facebook’s Open Graph Protocol, which is their version of the semantic web. This is a useful example to illustrate the ideas behind the semantic web and linked data. Libraries and other cultural institutions want and need to make their data open, and Facebook’s openness is highly questionable, so it will also illustrate some of the potential problems with linked data that isn’t open. There is much great work being done in the library world with the semantic web and linked data, which will be addressed in more detail in further posts.

The Semantic Web and Linked Data

The “semantic web” describes a web where data is understood by computers in some of the same ways humans understand it. Tim Berners-Lee illustrates this wonderfully in his 2001 Scientific American article with a future in which the diagnosis of a family member with cancer is made easier by the smart device which can find the most appropriate specialist in a convenient location at a convenient time, with very little work on the part of the searcher. This is only possible, however, when data is semantically meaningful. Open hours for a doctor (or a library) written on a website mean something to a human, but very little to a computer. Once those hours are structured in a way that can be made meaningful, the computer can tell you if the doctor’s office is open–and if it has access to your calendar, what you have to cancel to go there.

Linking data takes this implementation a step further and makes it possible to connect data, to avoid, as the W3C says “a sheer collection of datasets”. Berners-Lee outlines the steps that need to be followed to make linked data in a 2006 post, namely to use uniform resource indicators (URIs) as names, to present those URIs in the hypertext protocol, use a standard format such as RDF to present useful information, and link to additional URIs with related information. A 2010 follow-up points out that to be linked open data, the data must be presented with a license that allows free unimpeded use, such as the Creative Commons CC-BY license. Such data doesn’t have to be structured in any particular way as long as it’s open. He says that “…you get one (big!) star if the information has been made public at all, even if it is a photo of a scan of a fax of a table — if it has an open licence.” But “five-star” linked open data meets all of the above requirements as well.

Facebook’s Open Graph Protocol

Moving into a different world, let’s consider what the semantic web and linked data look like at Facebook. First, it is interesting to consider what Facebook was before it was semantic. When Facebook first started in 2005, you could make a list of things you “liked”. You might have said you “liked” the movie Clueless and “liked” running, but these were just lists that would let others in your college classes know a few facts about you next time you saw them in class or at a party. In theory you could use these lists to find others that shared your interests, but this required a person to understand what interests matched each other.

But starting in 2010 these “likes” took on a real semantic meaning. Suddenly “liking” the movie Clueless meant that, among other things, the owners of the “Clueless” identity on Facebook could directly send you marketing announcements. In addition, you could “like” content outside of Facebook completely as long as that website used the correct markup on the page to speak to Facebook, and thus link together content with people. Unlike Facebook’s earlier scheme of Beacon, it was easier to understand how you were exposing yourself to advertisers and to control privacy and sharing, though this still left people troubled.

In late 2011/early 2012 Facebook opened up this system even more to third party developers, which went along with the new Facebook Timeline. Now any person could perform any verb with any application. So “Margaret read a book on Goodreads” or “Margaret listened to a song on Spotify”–real world actions–turn into semantically meaningful statements on my Facebook Timeline. As long as the user authenticates the application, the application can access the necessary information to grab the information about the object from the webpage and show the user’s interaction with it.

Developing for the Open Graph

The Open Graph protocol was developed based on the idea of the “social graph”, which represents the connections between people and the types of relationships they have with each other. In the Facebook universe, this includes the relationships people have with other types of entities, such as media, products, and companies. It was developed by Facebook to make a quick and easy way for websites to include semantically meaningful data. It is based on the standard RDF specification for linked data and includes basic and optional metadata, as well as different types of structured data about objects, of which music and videos are the most well-defined.

To see the Open Graph in action, simply replace “www” with “graph” at the beginning of any Facebook page. For instance, let’s take a look at my own library’s information at http://graph.facebook.com/rebeccacrownlibrary. You can see that this page describes a library, and get our phone number, physical location, and open hours. Most important, a computer viewing this page can understand this information. For complete details, see the Graph API documentation–even for non-developers this is interesting; for instance, find out how to get the URL for your current profile picture to embed in other sites. To get access to this information, you can use various methods, including the Facebook Query Language.

Of course, you only get access to this information if it’s explicitly made public by the page. For anything beyond that, applications must use authentication in order to access more. Linking information from outside of Facebook is one way only–you can’t pull very much at all out of Facebook into the open web. Note that, for instance, Google searches will pull up only basic information from a Facebook page rather than any content that page has posted.

Outside of Facebook–How “Open” is the Open Graph?

It is precisely this closed effect that has a lot of people worried about Facebook’s implementation of the semantic web. Brad Fitzpatrick described the problems in 2007 inherent in implementations of the “social graph” on the web, which was that standards were quirky, non-interoperable, and usually completely walled off. The solution would be a Social Graph API that would create a social graph outside of any one company and belonging to all. This would allow people to find friends and connections without signing up for additional services or relying on Facebook or any other company. Fitzpatrick did later create a Social Graph API, which Google recently pulled out of their products. Some of the problems of an open social graph are familiar to librarians: people are hesitant to share too much information with just anyone about with whom they associate, what they like, and what they think (Prodromou). The great boon for advertisers in social networking services is that inside walled gardens with reasonable privacy controls is that people are willing to share much more information. Thus the walled garden of Facebook, inaccessible to Google, means that that valuable social data is inaccessible. It is perhaps not coincidental that around the same time Google stopped supporting the open Social Graph API that they released the API for their own social networking service Google Plus.

Concerns with the Open Graph remain that it is not actually open, and in particular that it uses the open standard of RDF to ingest but not share content (Turenhout). The Open Graph Protocol website states that a variety of big websites are publishing websites with Open Graph markup and it is ingested by Facebook (of course), Google, and mixi. It remains unclear how much this particular standard will be adopted outside of Facebook.

Conclusion

Whether or not you think you have any idea what linked data is, any time you click a “like” button on a website or sign up for a social sharing app in Facebook, you are participating in the semantic web. But every time that data link goes behind a Facebook wall, it fails in being open linked data. Just as librarians have always worked to keep the world’s knowledge available to all, we must continue to ensure that potentially important linked data is kept open as well–and with no commercial motive. The LODLAM Summit has outlined and continues to work on what linked open data looks like for libraries, archives, and museums. The W3C Library Linked Data Incubator Group released its final report in fall 2011, which provides a thorough overview of the roles and responsibilities of libraries in the world of linked open data. There is a lot of possibility around this area right now, and the future openness of the world wide web may very well depend on action taken right now.

In a future post, we will examine some specific examples of work being done in the library world around the semantic web and linked data.