Saturday, February 7th, 2009

Distinct authors, phase 1 / Steve Martin is funny again

Short version. I’ve added a mechanism to “split” distinct authors with the same name. You can find it on the right of any author page, under “Author Disambiguation.” The feature is only partially rolled-out, without separate pages for distinct authors or other rammifications for the LibraryThing system.

The problem is deep in the data. Libraries have a system for disambiguating authors, called Authority Control, based on coming up with authorized forms of a name and adding dates and other metadata to make them unique, and then applying these forms across the books. Authority control is a good idea—if often problematic to implement—but it falls down in the face of LibraryThing’s data. Libraries don’t coordinate their authority control as much as you’d think, and LibraryThing draws from almost 700 libraries. And even if authority control worked in libraries, 90% of LibraryThing content comes from other sources, mostly Amazon. This data has no concept of authority control. (See Steve Martin at Amazon, for example.)

In solving the problem, I decided to ignore how libraries solved the issue and concentrate on how LibraryThing could do it most easily. Authority control requires librarians to assemble data (eg., birth and death dates) about name variants before a split is made. (Thus was born librarians’ unfortunate policy of putting out hits on individuals they could not otherwise distinguish.*) Although LibraryThing members have done an amazing job finding birth and death dates, it was still a lot of work. And a full authority-control solution would have members updating each other’s records with the “authorized” forms of the names!

I felt a better way could be found. Instead of establishing unique names and pushing them to records, members could split works arbitrarily, and the authors would come to be known by the name they share and the works that cluster under them. This is actually an old system—calling someone “the author of Ivanhoe” or “the one who wrote the Parthian history.” And, as with other features of LibraryThing cataloging, it accords with how regular people talk about. In a real-world situation, like a meeting of Newfoundland commedians, you wouldn’t refer to “Martin, Steve, 1945-” and “Martin, Steve, 1947-” but “Steve Martin, you know, the one who wrote Shopgirl” and “Steve Martin, the one who wrote that book about that boat.”

How it works. To split an author, find the area on the right labelled “Author Disambiguation.” It will take you to a splitting page; here’s Steve Martin’s. This page allows you to assign all the author’s works to numbers. As you assign the works, LibraryThing assigns separate colors, making it easy to see at a glance how the thing is going.

More to do. This is just a first step. The “distinct authors” feature has to “go” all sorts of places on the site. First up will be separate pages for distinct authors–and a “disambiguation page” (a la Wikipedia) tying them together. Once that’s done we can move to separate author metadata, such as Common Knowledge, bettween distinct authors.

Quite frankly, I’m going to do a few more things and then let this sit for a while. My main focus right now—and Chris’—is to see “collections” to the finish line. When I realized I could bang out the first phase of distinct authors in a long evening (it’s after 5am now), I went ahead and did it. But now I need to refocus on collections.

Talk about it. I’ve set up a New features post to discuss the change, and its potential rammifications. I suspect that the Combiners! group will get in on the act quickly as well, working out various technical issues. They have a number of threads (here, here and here, at least), in which members have made lists of “identically named authors.” They would be a good starting-point.