authorities missing in gap between id.loc.gov and viaf?

More and more authorities data is online for free in ways relatively easily accessible to machine use, which is actually pretty exciting. In the form of id.loc.gov for LCSH subject authorities, and the VIAF for name authorities from the US and other national files.

But I’m not entirely sure of the scope of either of these services.

Who can explain the following mystery:

“Rote Armee Fraktion” is an LCSH “subject” (600 marc field) heading that appears in our catalog on several items. (It’s the German for “Red Army Faction”). But it’s not in id.loc.gov. Well, not exactly, it’s there combined with a subdivision, “Rote Armee Fraktion — In art”, but normally the main division would be there too, just the combination authority there alone is not enough for machine processing, right?

So I figured, aha, since it’s a name, it’s probably not controlled in LCSH authority files, but in the Name Authority File at VIAF. And yet, it’s not in VIAF either.

So, who, perhaps a cataloger, can explain the mysteries of person/corporate-body name control? If this is controlled by LCSH as a subject heading, why is it not in id.loc.gov? On the other hand, if it is instead (or also) controlled by the NAF, why is it not on viaf.org? It’s clearly an official heading, it’s in LC’s catalog, right?

What is the nature of the current gap of LCSH/NAF authorities not available in either id.loc.gov or viaf.org, anyone know what’s going on?

Postscript: But the heading can be found in WorldCat Identities at that same LCCN 2008030542 (or the non-normalized form). The plot thickens! (But I don’t think worldcat identities reveals enough of the information I’m interested in at the moment, and it doens’t provide a bulk download, I don’t think, like id.loc.gov and VIAF do, something I’d need for what I’m thinking of. Identities is a potentially useful service in other ways, and could be more useful if they’d provide a bulk data download).

15 thoughts on “authorities missing in gap between id.loc.gov and viaf?”

According to this: http://id.loc.gov/authorities/about.html it looks like name authority records are not necessarily included in the id.loc.gov list. (See the section on “Which vocabularies are included”, second paragraph) The why of it not appearing in the VIAF puzzles me, since I would have thought that the entire LCNAF would appear in the VIAF the way it will definitely appear in OCLC’s authority file (those two are mirrored nightly). And the VIAF website offers no explanation.

Ralph: I haven’t extensively looked into either VIAF or Identities. I _thought_ VIAF gave me an RDF download of it’s entire corpus, but maybe not? Does Identities give me anything like that, a way to bulk download the entire set of data (or at least a subset of it including headings, “see from”s, and relationships?).

I want to (someday) use this info to provide “query expansion” in my solr index, using the authorities “see from” (4xx) information.

“the odd thing to me is that in a lookup on authorities.loc.gov, that heading is identified as a subject authority, not a name authority.”

All the authorized names also show up in the “Subject authority headings” search on authorities.loc.gov. But id.loc.gov as far as I’ve been able to determine only seems to include topical authority headings, or headings with topical subdivisions.

So, for instance, “Rote Armee Fraktion” is an authorized corporate name (defined with a 110 tag), so it shows up in both the “Subject Authority Headings” and the “Name Authority Headings” in authorities.loc.gov. It does not show up in id.loc.gov.

On the other hand, an authorized heading with a 150 tag, like “Communism” shows up only in “Subject authority Headings” and not “Name Authority Headings”. It does show up in id.loc.gov.

Now, “Rote Armee Fraktion — In Art” is also defined with a 110 tag, but because of the subdivision it does not show up in “Name Authority Headings” (which does not include headings with visible subdivisions) but only in “Subject Authority Headings”. And it shows up in id.loc.gov!

In sum, as far as I can tell, id.loc.gov should include all the headings that were in the “Subject Authority Headings” but NOT “Name Authority Headings” in authorities.loc.gov at the time of extraction. I’m not involved myself in the management of id.loc.gov, so don’t consider this answer definitive, but it seems to be consistent with my observations.

Now, while I’ve got a lot of LCNAF in there, I only indexed the 100 fields for name searching. I’ll see about a separate corporate names index and about getting the cql.any index to do something useful.

I also noticed that the LCCN index doesn’t handle the minimally normalized forms that we’ve come to expect. I’ll update the indexing rule for that as well.

I’ll make an announcement on the worldcat developers network mailing list when that happens. It should be in just a few days.

Thanks Ralph. There’s no good way to bulk download the data though, is there?

If I want to use it at _indexing_, to add synonyms to my indexed records… doing an SRU lookup per record indexed is going to be a performance problem both for my indexing, and, with several million records, for your SRU endpoint too! And yeah, it would need the corporate names to be indexed for lookup even if I did do a per-record lookup.

Personally, I don’t see any need to even put personal and corporate names in seperate indexes. While they are in seperate marc fields, I’m not thinking of any actual use case that requires them to be distinguished. Aren’t they all just “names in the NAF”? (Or can a personal name and a corporate name be exactly the same string? Does the NAF not impose uniqueness accross the boundary? Gee, I hope it does. And if it does… can’t see why you’d need seperate indexes).

But really, for my intended hypothetical use, I’d need a bulk download of some kind.

We’ve talked about an RSS mechanism in VIAF that would let the national libraries download all the VIAF records that had links to their records. But, even with RSS, that could get tedious for millions of records. What you’d really like is some sort of tarball and then RSS/OAI-PMH as a synchronization mechanism. I’ve seen no proposals for that sort of technology. If you’ve got thoughts on the matter, I’d be interested.

Ralph, how about just starting simple? Just plain a tarball, of whatever format you have the info in and want to release it in (MARCXML, RDF, whatever). No need for even an RSS/OAI-PMH syncronizatio method, neccesarily. Just start by giving out the data, and THEN when people say “Gee, this is nice, but just the tarball without any syncronization makes it hard to do X”, THEN you’ve got an actual use case to respond to, instead of a hypothetical.

But just the data is a LOT better than nothing, and possibly better than an over-engineered syncronization method for unclear use cases.

So as far as a proposal, mine is simply: tarball, please, of whatever format(s) you already have and are willing to release.

Awesome, thanks for spearheading at least the “talking about it” phase, Ralph.

Oh, and I guess as far as “specs”, let me add to “just a tarball”, that the tarball should be at a persistent URI, and ideally the HTTP server would respond to a HEAD request with an accurate ‘last updated’ response, so I’d know if there was a new tarball since last time I downloaded. Pretty low barrier though.

Note that id.loc.gov DOES provide a bulk download, although not in Marc. This was a pretty pleasantly surprising step from LC even. Whether it’s in Marc or not is less important to me as a potential user/client than what information is provided; I think you already have a relatively semantically rich RDF view in VIAF, a bulk download of that format would be quite sufficient (if I’m right about it’s existence. :) ).