But we’re actually getting something new worth noting this year. Today we’re seeing scholarship-quality transcriptions of tens of thousands of early English books — the EEBO Text Creation Partnership Phase I texts – become available free of charge to the general public for the first time. (As I write this, the books aren’t accessible yet, but I expect they will be once the folks in the project come back to work from the holiday.) (Update: It looks like files and links are now on Github; hopefully more user-friendly access points are in the works as well.)

This isn’t a new addition to the public domain; the books being transcribed have been in the public domain for some time. But it’s the first time many of them are generally available in a form that’s easily searchable and isn’t riddled with OCR errors. For the rarer works, it’s the first time they’re available freely across the world in any form. It’s important to recognize this milestone as well, because taking advantage of the public domain requires not just copyrights expiring or being waived, but also people dedicated to making the public domain available to the public.

And that is where we who work in institutions dedicated to learning, knowledge, and memory have unique opportunities and responsibilities. Libraries, galleries, archives, and museums have collected and preserved much of the cultural heritage that is now in the public domain, and that is often not findable– and generally not shareable– anywhere else. That heritage becomes much more useful and valuable when we share it freely with the whole world online than when we only give access to people who can get to our physical collections, or who can pay the fees and tolerate the usage restrictions of restricted digitized collections.

So whether or not we’re getting new works in the public domain this year, we have a lot of work to do this year, and the years to follow, in making that work available to the world. Wherever and whenever possible, those of us whose mission focuses more on knowledge than commerce should commit to having that work be as openly accessible as possible, as soon as possible.

That doesn’t mean we shouldn’t work with the commercial sector, or respect their interests as well. After all, we wouldn’t have seen nearly so many books become readable online in the early years of this century if it weren’t for companies like Google, Microsoft, and ProQuest digitizing them at much larger scale than libraries had previously done on their own. As commercial firms, they’re naturally looking to make some money by doing so. But they need us as much as we need them to digitize the materials we hold, so we have the power and duty to ensure that when we work with them, our agreements fulfill our missions to spread knowledge widely as well as their missions to earn a profit.

We’ve done better at this in some cases than in others. I’m happy that many of the libraries who partnered with Google in their book scanning program retained the rights to preserve those scans themselves and make them available to the world in HathiTrust. (Though it’d be nice if the Google-imposed restrictions on full-book downloads from there eventually expired.) I’m happy that libraries who made deals with ProQuest in the 1990s to digitize old English books that no one else was then digitizing had the foresight to secure the right to make transcriptions of those books freely available to the world today. I’m less happy that there’s no definite release date yet for some of the other books in the collection (the ones in Phase II, where the 5-year timer for public release doesn’t count down until that phase’s as-yet-unclear completion date), and that there appears to be no plan to make the page images freely available.

But when I hear about firms like Taylor and Francis charging as much as $48 to nonsubscribers to download a 19th century public domain article from their website for the Philosophical Magazine, I’m going to be much more inclined to take the time to promote free alternatives scanned by others. And we can make similar bypasses of not-for-profit gatekeepers when necessary. I sympathize with Canadian institutions having to deal with drastic funding cuts, which seem to have prompted Early Canadiana Online to put many of their previously freely available digitized books behind paywalls– but I still switched my links as soon as I could to free copies of most of the same books posted at the Internet Archive. (I expect that increasing numbers of free page scans of the titles represented in Early English Books Online will show up there and elsewhere over time as well, from independent scanning projects if not from ProQuest.)

Assuming we can hold off further extensions to copyright (which, as I noted last year, is a battle we need to show up for now), four years from now we’ll finally have more publication copyrights expiring into the public domain in the US. But there’s a lot of work we in learning and memory institutions can do now in making our public domain works available to the world. For that matter, there’s a lot we can do in making the many copyrighted works we create available to the world in free and open forms. We saw a lot of progress in that respect in 2014: Scholars and funders are increasingly shifting from closed-access to open-access publication strategies. A coalition of libraries has successfully crowdfunded open-access academic monographs for less cost to them than for similar closed-access print books. And a growing number of academic authors and nonprofit publishers are making open access versions of their works, particularly older works, freely available to world while still sustaining themselves. Today, for instance, I’ll be starting to list on The Online Books Page free copies of books that Ohio State University Press published in 2009, now that a 5-year-limited paywall has expired on those titles. And, as usual, I’m also dedicating a year’s worth of 15-year-old copyrights I control (in this case, for work I made public in 2000) to the public domain today, since the 14-year initial copyright term that the founders of the United States first established is plenty long for most of what I do.

As we celebrate Public Domain Day today, let’s look to the works that we ourselves oversee, and resolve to bring down enclosures and provide access to as much of that work as we can.

June 16, 2014

As mass digitization progresses, and as copyright terms grow longer, we now have access to much more literature from 1922 and the years before it than from the years that come soon after it. This graph by Eric Crampton, based on research by Paul Heald, is one illustration of the problem. Copyrights from 1922 and before have expired in the US (where much of the mass digitization to date has been done); copyrights from 1923 or later may or may not still be in force, and research is required to determine whether a book from after 1922 is still in copyright in the US.

As it turns out, many of them are in the public domain, particularly those published in the US between 1923 and 1963, whose copyrights had to be renewed to stay in force. Some organizations, most notably HathiTrust, have opened access to hundreds of thousands of book volumes by researching renewals and other information on books with non-obvious copyright status. In contrast, their open access collections of periodicals and other serials, like those at JSTOR and the Library of Congress, end at 1922.

Why not continue serials past 1922 as well? One reason is that copyright clearance for serials is quite a bit more complicated than for books. Like a book, an issue of a serial can be copyrighted and renewed. But an individual contribution to a serial can also be copyrighted and renewed on its own. So in order to see whether a serial volume is in the public domain, you need to check a lot more potential copyrights. [1]

However, now that all active copyright renewal records are represented online, it’s possible to check whether a particular serial issue or volume is under copyright. It hasn’t been feasible to do it quickly until now, though, because a lot of the Catalog of Copyright Entries, which has records of all copyright renewals, is only online in page-image form, and not as reliably searchable text. But over time I’ve been compiling an inventory of periodicals that have renewals of various sorts. That inventory has now reached the point where you can determine if periodicals have renewed copyrights, and when the first active renewal was made, with a few text searches.

Here’s what you search:

First, look for the name of the periodical in my First Copyright Renewals for Periodicals web page. This will tell you if there were any issue renewals made between 1950 and 1977 (for publications between 1923 and 1950). It will also tell you if there were any contribution renewals made between 1950 and mid-1953 (for publications between 1923 and 1926). I’m still working on adding periodicals whose first contribution renewal was made after mid-1953, but the resources below can be used to find those.

Second, search for the name of the periodical in Project Gutenberg’s compilation of copyright renewals. This is a large text file that includes both book renewals and periodical contribution renewals from mid-1953 to 1977. It doesn’t have periodical issue renewals, but those are covered in the other resources in this list, and the contribution renewals neatly pick up where my file above left off.

Third, search for the name of the periodical in the US Copyright Office online database. This contains renewals both for periodical issues and for periodical contributions (and any other kind of renewal as well) from 1978 onward. So it picks up where my First Copyright Renewals file leaves off for issues, and where Project Gutenberg’s file leaves off for contributions. My “How Can I Tell Whether a Copyright Was Renewed?” FAQ includes instructions on how to search this database.

If you search all three of these (using appropriate keywords and name variants– periodical names can vary over time) and don’t find any mention of the periodical, there’s a good chance there aren’t any copyright renewals you have to worry about.[2] If you search them in the order above and do find mention of the periodical, you’ll have a good idea about when in the periodical’s run you have to start worrying about copyright renewals.

So far, I’ve found out that, just as most periodical issues of the mid-20th century do not have renewed copyrights, most periodical contributions of the mid-20th century don’t, either. I’ve been looking primarily at the 1920s so far, and most of the renewed contributions I’ve seen are fiction. Short stories and serialized novels appeared in a variety of periodicals, including upmarket magazines like Collier’s and the Saturday Evening Post, pulps like Weird Tales and Marriage Stories, and daily newspapers like the Oakland Tribune and Cincinnati Enquirer.

There are a number of nonfiction renewals as well, but these tend to be less common, and often only involve a few people. Many of these renewals are for feature articles or poetry by well-known figures. (Edna St. Vincent Millay, for example, published poems in a number of magazines, and renewed many of their copyrights.) Renewals in scholarly and trade journals are rarer still, and even more strictly limited to particular people. The widow of S. S. Goldwater, for instance, appears to be responsible for the only three active contribution renewals I have found in the Journal of the American Medical Association; she also renewed the articles he wrote for Modern Hospital.

Some periodical copyright renewals may be outliving the periodicals themselves. One contribution renewal I found was for an article in the January 1926 issue of New Eve, a magazine of flapper fashion, fiction, and photography published in New York. I have not been able to find any copy of that issue anywhere. The Internet Archive has a scan of the May 1926 issue; WorldCat knows of two rare book collections that have the April issue. And that’s it– I haven’t turned up any copies of the January issue, or any other issue, in library catalogs or book sites. When that issue enters the public domain, hopefully in a few years, there may be no copies left to scan. Maybe some still exist, but the collections that hold them haven’t been cataloged, or made their catalogs available online. Either way, these records point to useful work that libraries can do to preserve rapidly vanishing cultural resources of the 20th century.

It’s also clear that libraries can make public a lot of the scholarly and professional resources of the early and middle 20th century. Since scholarly and trade journals have a very low renewal rate, both for issues and contributions, and nearly everything published in those journals was original content (so one doesn’t have to worry about reprints from other publications), the content of many journals is completely, or almost completely, in the public domain well past 1922, in many cases as late as the early 1960s. Moreover, in many fields journal articles as much as 50 or 100 years old are still of research or scholarly interest. If libraries open up the content of mid-20th-century journals, scholars and readers could benefit at least as much as they do now from HathiTrust opening up the content of mid-20th-century books.

I’ll be continuing to sweep through periodical contribution renewals from the 1920s onward, and updating my “first periodical contributions” pages as I do so. But even with what I have now, the three-step renewal search I describe above is a powerful way to find out the copyright status of periodicals published in the US. I hope I’ll hear soon about good things libraries and readers are doing with them.

[1] The situation is even worse when copyright terms are based on an author’s lifetime, as is the case in most countries outside the US, and in the US for recent publications. You might need to check the lifespans of hundreds or even thousands of identifiable authors to see if a particular serial volume is in the public domain. (Back)

[2] Fine print caveats: These searches will not find renewals in other categories that might conceivably appear in a periodical, like images, music, and drama. However, there are very few copyright renewals for published images (and are all browsable in the Catalog of Copyright Entries), and music and drama only appear in certain kinds of periodicals. Also, searches on a particular publication might not turn up renewals for copyrighted materials that also appeared in other publications. This is mainly an issue for periodicals that commonly reprinted items, included book excerpts, or that ran syndicated material. Also, it’s possible that there are typos or other mistakes in the files that I or Project Gutenberg prepared, so if you want to minimize risk, you may want to double-check against the page images of catalog registration records, or the original Copyright Office records. Periodicals not published in the US may be exempt from copyright renewal requirements, but as I’ve noted before, many periodicals of non-US origin can be considered published in the US for the purpose of copyright law. Finally, I am not a lawyer; for legal advice, consult with appropriate counsel. (Back)

January 1, 2014

New Years’ Day is upon us again, and with it, the return of Public Domain Day, which I’m happy to see has become a regular celebration in many places over the last few years. (I’ve observed it here since 2008.) In Europe, the Open Knowledge Foundation gives us a “class picture” of authors who died in 1943, and whose works are now entering the public domain there and in other “life+70 years” countries. Meanwhile, countries that still hold to the Berne Convention’s “life+50 years” copyright term, including Canada, Japan, and New Zealand, and many others, get the works of authors who died in 1963. (The Open Knowledge Foundation also has highlights for those countries, where Narnia/Brave-New-World/purloined-plums crossover fanfic is now completely legal.) And Duke’s Center for the Study of the Public Domain laments that, for the 16th straight year, the US gets no more published works entering the public domain, and highlights the works that would have gone into the public domain here were it not for later copyright extensions.

It all starts to look a bit familiar after a few years, and while we may lament the delays in works entering the public domain, it may seem like there’s not much to do about it right now. After all, most of the world is getting another year’s worth of public domain again on schedule, and many commentators on the US’s frozen public domain don’t see much changing until we approach 2019, when remaining copyrights on works published in 1923 are scheduled to finally expire. By then, writers like Timothy Lee speculate, public domain supporters will be ready to fight the passage of another copyright term extension bill on Congress like the one that froze the public domain here back in 1998.

We can’t afford that sense of complacency. In fact, the fight to further extend copyright is raging now, and the most significant campaigns aren’t happening in Congress or other now-closely-watched legislative chambers. Instead, they’re happening in the more secretive world of international trade negotiations, where major intellectual property hoarders have better access than the general public, and where treaties can be used to later force extensions of the length and impact of copyright laws at the national level, in the name of “harmonization”. Here’s what we currently have to deal with:

Remaining Berne holdouts are being pushed to add 20 more years of copyright. Remember how I said that Canada, Japan, and New Zealand were all enjoying another year of “life+50 years” copyright expirations? Quite possibly not for long. All of those countries are also involved in the Trans-Pacific Partnership (TPP) negotiations, which include a strong push for more extensive copyright control. The exact terms are being kept secret, but a leaked draft of the intellectual property chapter from August 2013 shows agreement by many of the countries’ trade negotiators to mandate “life+70 years” terms across the partnership. That would mean a loss of 20 years of public domain for many TPP countries, and ultimately increased pressure on other countries to match the longer terms of major trade partners. Public pressure from citizens of those countries can prevent this from happening– indeed, a leak from December hints that some countries that had favored extensions back in August are reconsidering. So now is an excellent time to do as Gutenberg Canada suggests and let legislators and trade representatives know that you value the public domain and oppose further extensions of copyright.

Life+70 years countries still get further copyright extensions. The push to extend copyrights further doesn’t end when a country abandons the “life+50 years” standard. Indeed, just this past year the European Union saw another 20 years added on to the terms of sound recordings (which previously had a 50-year term of their own in addition to the underlying life+70 years copyrights on the material being recorded.) This extension is actually less than the 95 years that US lobbyists had pushed for, and are still pushing for in the Trans-Pacific Partnership, to match terms in the US.

(Why does the US have a 95-year term in the first place that it wants Europe to harmonize with? Because of the 20-year copyright extension that was enacted in 1998 in the name of harmonizing with Europe. As with climbers going from handhold to handhold and foothold to foothold higher in a cliff, you can always find a way to “harmonize” copyright ever upward if you’re determined to do so.)

The next major plateau for international copyright terms, life+100 years, is now in sight. The leaked TPP draft from August also includes a proposal from Mexico to add yet another 30 years onto copyright terms, to life+100 years, which that country adopted not many years ago. It doesn’t have much chance of passage in the TPP negotiations, where to my knowledge only Mexico has favored the measure. But it makes “life+70″ seem reasonable in comparison, and sets a precedent for future, smaller-scale trade deals that could eventually establish longer terms. It’s worth remembering, for instance, that Europe’s “life+70″ terms started out in only a couple of countries, spread to the rest of Europe in European Union trade deals, and then to the US and much of the rest of the world. Likewise, Mexico’s “life+100″ proposal might be more influential in smaller-scale Latin American trade deals, and once established there, spread to the US and other countries. With 5 years to go before US copyrights are scheduled to expire again in significant numbers, there’s time for copyright maximalists to get momentum going for more international “harmonization”.

What’s in the public domain now isn’t guaranteed to stay there. That’s been the case for a while in Europe, where the public domain is only now getting back to where it was 20 years ago. (The European Union’s 1990s extension directive rolled back the public domain in many European countries, so in places like the United Kingdom, where the new terms went into effect in 1996, the public domain is only now getting to where it was in 1994.) But now in the US as well, where “what enters the public domain stays in the public domain” has been a long-standing custom, the Supreme Court has ruled that Congress can in fact remove works from the public domain in certain circumstances. The circumstances at issue in the case they ruled on? An international trade agreement– which as we’ve seen above is now the prevailing way of getting copyrights extended in the first place. Even an agreement that just establishes life+70 years as a universal requirement, but doesn’t include the usual grandfathered exception for older works, could put the public domain status of works going back as far the 1870s into question, as we’ve seen with HathiTrust international copyright determinations.

But we can help turn the tide. It’s also possible to cooperate internationally to improve access to creative works, and not just lock it up further. We saw that start to happen this past year, for instance, with the signing of the Marrakesh Treaty on copyright exceptions and limitations, intended to ensure that those with disabilities that make it hard to read books normally can access the wealth of literature and learning available to the rest of the world. The treaty still needs to be ratified before it can go into effect, so we need to make sure ratification goes through in our various countries. It’s a hopeful first step in international cooperation increasing access instead of raising barriers to access.

Another improvement now being discussed is to require rightsholders to register ongoing interest in a work if they want to keep it under copyright past a certain point. That idea, which reintroduces the concept of “formalities”, has been floated some prominent figures like US Copyright Register Maria Pallante. Such formalities would alleviate the problem of “orphan works” no longer being exploited by their owners but not available for free use. (And a sensible, uniform formalities system could be simpler and more straightforward than the old country-by-country formalities that Berne got rid of, or the formalities people already accept for property like motor vehicles and real estate.) Pallante’s initial proposal represents a fairly small step; for compatibility with the Berne Convention, formalities would not be required until the last 20 years of a full copyright term. But with enough public support, it could help move copyright away from a “one size fits all” approach to one that more sensibly balances the interests of various kinds of creators and readers.

We can also make our own work more freely available. For the last several years, I’ve been applying my own personal “formalities” program, in which I release into the public domain works I’ve created that I don’t need to further limit. So in keeping with the original 14-year renewable terms of US copyright law, I now declare that all work that I published in 1999, and that I have sole control of rights over, is hereby dedicated to the public domain via a CC0 grant. (They join other works from the 1900s that I’ve also dedicated to the public domain in previous years.) For 1999, this mostly consists of material I put online, including all versions of Catholic Resources on the Net, one of the first websites of its kind, which I edited from 1993 to 1999. It also includes another year’s history of The Online Books Page.

We can do more with work that’s under copyright, or that seems to be. Sometimes we let worries about copyright keep us from taking full advantage of what copyright law actually allows us to do with works. In the past couple of years, we saw court rulings supporting the rights of Google and HathiTrust to use digitized, but not publicly readable, copies of in-copyright books for indexing, search, and preservation purposes. (Both cases are currently being appealed by the Authors Guild.) HathiTrust has also researched hundreds of thousands of book copyrights, and as of a month ago they’d enabled access to nearly 200,000 volumes that were classified as in-copyright under simple precautionary guideliness, but determined to be actually in the public domain after closer examination.)

In the coming year, I’d like to see if we can do similar work to open up access to historical journals and other serials as well. For instance, Duke’s survey of the lost public domain mentions that articles from 1957 major science journals like Nature, Science, and JAMA are behind paywalls, but as far as I’ve been able to tell, none of those three journals renewed copyrights for their 1957 issues. Scientists are also increasingly making current work openly available through open access journals, open access repositories, and even discipline-wide initiatives like SCOAP3, which also debuts today.

There are also some potentially useful copyright exemptions for libraries in Section 108 of US copyright law that we could use to provide more access to brittle materials, materials nearing the end of their copyright term, and materials used by print-impaired users.

Supporters of the public domain that sit around and wait for the next copyright extension to get introduced into their legislatures are like generals expecting victory by fighting the last war. There’s a lot that public domain supporters can do, and need to do, now. That includes countering the ongoing extension of copyright through international trade agreements, promoting initiatives to restore a proper balance of interest between rightsholders and readers, improving access to copyrighted work where allowed, making work available that’s new to the public domain (or that we haven’t yet figured out is out of copyright), and looking for opportunities to share our own work more widely with the world.

So enjoy the New Year and the Public Domain Day holiday. And then let’s get to work.

October 4, 2013

Have you ever wanted to read a library? I know I have. More than once when I was young, and introduced to a new library, I’d contemplate, if just for a moment, whether I could read every book in it. (Eventually I’d do the math and realize I’d never finish in my lifetime. But I’m pretty sure I’m not the only person who thought about doing it.)

While human beings can’t literally read the text of every book in a library past a certain size, “reading libraries” from a distance — that is, understanding the nature of a large collection of information, and how it can answer out questions and satisfy our interests — is something that researchers do a lot, and for good reason. We do it, in different ways, when we survey the research output of an individual or institution; when we scan shelves of books for relevant sources; when we assess the likely worth of material from a given source; and when we conduct searches in catalogs, or library databases, or web search engines, for books, articles and websites that will answer our questions.

Computer-mediated aggregation and analysis of online information gives us a whole new set of methods for “reading” collections from various distances and viewpoints. Collectively, we’re still trying to determine the most useful methods, and the most effective implementations, for “distant reading” of collections of information. Lately I’ve been putting a fair bit of time into thinking and talking about some of them.

Back in August, for instance, Anne Seymour and I gave a talk at the VIVO 2013 Conference in St. Louis, titled “How to Read 200,000 Publications: VIVO and the Intelligent Evaluation of Scholarship”. VIVO is a semantic web application that we’re now using at Penn to aggregate, analyze, and share information about the research publications of thousands of Penn’s faculty researchers in health sciences. In our talk, we discuss ways that people could use the VIVO service and its data to better evaluate the work of these researchers. We emphasize the importance of looking at a body of work in a variety of ways, both from a distance, and close-up. And we note the dangers of relying on tempting shortcuts, like “impact factors“, to substitute for more relevant and considered assessments. We’ve posted the slides and approximate script for our talk online.

Another kind of library-reading occurs where library users ask “What does the library have to offer on a topic I’m interested in?” (“How should I start reading what it offers?” is an important followup question as well.) Librarians expect their catalogs and discovery systems to be able to answer these questions for their patrons. Some of us, like Lorcan Dempsey, say that libraries can provide much better answers if our systems support “full library discovery” encompassing many different types of information. Some, like Dale Askey, say that “Google won the discovery wars years ago“, and conclude that it’s not worth spending a lot of resources on library-specific systems that “are of little interest to our users”.

I see some truth in both of these positions. I believe that libraries have a lot of relevant offerings (not just books, articles, and other information sources, but also guidance, expertise, and space) that can help answer researchers’ questions. I also believe we could do a lot better at making these offerings apparent and intelligible to our users. But I agree that library discovery systems are not going to retake the lead of Google (and other heavily used Internet sites like Wikipedia) as common starting points for research. The findings in OCLC’s recent Perceptions of Libraries survey report, which includes the note on page 32 that “Not a single survey respondent began their information search on a library Web site,” are sobering confirmations of the futility of trying to out-Google Google.

That’s why I’ve lately been stressing the importance of making it easy for library users to find their library’s offerings from wherever they start searching. Search-engine-friendly catalogs and digital repositories are important parts of solving this problem. So is the ability to seamlessly route library users to resources that the library licenses, without them being stopped by a paywall. And so is referring users looking up topics in their favorite search engines and information sites to overviews of what their own libraries offer on those topics. Those overviews could be the attractive “bento box” overviews that Lorcan Dempsey discusses, or they could just be plain old library catalog searches, if that’s all a library can manage. Either way, getting a user to those summaries of a library’s offerings helps ensure they will get used when needed.

OCLC has offered local library referrals for specific bibliographic items from Worldcat.org for a while through its deep linking service, which is based on ISBNs and similar identifiers. More recently, I’ve been implementing local library linking for specific subjects, authors and works through the Forward to Libraries service. I’ve described in previous posts how this service now provides links from The Online Books Page and from Wikipedia, and can link to an ever-growing array of libraries of all kinds. Lots of other sites could be linking out to libraries using the FTL service as well. I’d like to get more digital and local libraries involved both in supporting and improving library-referral links.

To that end, I’ll be giving a couple of talks on Forward to Libraries in the next few weeks. At the DPLAfest in Boston on October 25, I’ll be co-leading a workshop on “Using Digital Tools to Extend the DPLA and Connect with Local Libraries“. Participants will not only see how they can take advantage of Forward to Libraries, but they’ll also get to see John Sarnowski describe and demonstrate tools that local libraries can use to digitize and share their local content with the world. At the Digital Library Federation Forum in Austin on November 5, participants will get to hear about FTL in “Forward to Libraries: Experiences Connecting Digital Libraries, Local Libraries, and Wikipedia“, one of a series of snapshot talks that also features many other interesting projects and ideas.

While my formal presentations at both meetings will be relatively brief, I’m very happy to talk more with interested folks about how they can provide links and overviews for local library resources. And outside those meetings, I’ll be glad to help hook up your library. At this point, all of the general research libraries in DLF should be registered with FTL, and hundreds of other libraries are registered as well. If you’d like to register your own library, or let me know of better ways for FTL to link to your library, just drop me a note via the service’s request form.

Comments Off

August 23, 2013

What a difference a few years can make. A few years ago, folks in the library world (myself included) were arguing about whether it was a good idea to let other people copy and build on their catalog records. Whether or not libraries could or should reuse and redistribute records from WorldCat, for example, was up in the air. Some of us were starting to take small steps towards putting catalog records under open licenses. For instance, I licensed the catalog records I created for The Online Books Page under the Creative Commons Attribution-ShareAlike license some years back. At the time, that was farther than many library projects were willing to go.

By now, though, there’s been a definite shift towards wider and more common opening up of bibliographic records. Large libraries like the German National Library and Harvard have released millions of the MARC records into the public domain. OCLC has revised its data policies to give their blessing to member libraries releasing their catalog data under ODC-BY. (They’ve also released some of their own data sets, like VIAF, under that license.) And large online library collaborations like Europeana and the Digital Public Library of America have adopted a policy of public domain status (using the CC0 declaration) for their bibliographic metadata. Both of these projects are now supplying promising platforms for projects that can aggregate, reuse, and build on this data in interesting and useful ways.

Now The Online Books Page is joining the CC0 party as well. Yesterday I put a CC0 notice on the more than 50,000 catalog records I’ve created for The Online Books Page’s curated collection over the last 20 years. (Yes, the site turns 20 this summer– it’s hard for me to believe it’s been up this long.) The “curated collection” refers to all the records that I’ve personally edited, as opposed to automatically importing from other projects. (Those records, automatically imported from sites like HathiTrust and Project Gutenberg, account for well over 1 million more books, and make up what I call the “extended shelves” of the site. I replace extended shelves records with personally edited “curated collection” records on request.)

While 50,000 records isn’t a huge number, compared to the number of records found in massive metadata collections at places like Google Books and the Harvard Library, I think it is still a useful metadata compilation. It covers a lot of important free online books and serials that for one reason or another are not in the large electronic book archives. I’ve also made various efforts to enhance the value of these records beyond what many of the more industrial-scale projects provide, including collating multi-volume works and serials, applying Library of Congress subject headings that support interesting modes of subject browsing, standardizing personal names (which enables linking between libraries and Wikipedia), and various other improvements. And I continue to add new records and improve existing ones in response to reader requests and corrections, and make them CC0 when I do.

More details about what data’s been placed into the public domain, and how to get our data, can be found on the site’s Copyrights and Licenses page. I hope people find it useful. And I’m thankful to all the other people and organizations that are now openly sharing their bibliographic data, and the people that are using it to make works easier to find and use online.

April 29, 2013

Libraries and bookstores have perennially faced the problem of how to organize books on their shelves. There’s a tension between making certain books easy to find for readers with one set of interests, and making them more difficult to find for other readers. For instance, some libraries and bookstores near me have a section for African American fiction. Readers particularly interested in African American authors can easily find their books in this section. But if novels by African American authors are shelved there instead of in the general fiction section, readers browsing general fiction might not find many African American authors there. Similar issues have arisen with genre fiction sections in libraries. A separate “Science Fiction” section can be a convenient service for for fans of that genre. But some readers have objected that such sections push science fiction off into a corner, making it easy for “mainstream” readers to overlook the genre.

In theory, online libraries shouldn’t have as much problem organizing their books and subjects. Freed from the physical constraints of bound paper and shelves, the same book can be placed in many virtual locations, not just one. But in practice, many of the problems of categorization persist in the online world. Last week, for instance, Amanda Filipacchi noted in the New York Times that Wikipedia’s category listing of American novelists was disproportionally male, in part because some editors had been taking women authors out of this category and moving them to the more specialized “American women novelists” category. As far as I can tell, Wikipedia policy does not call for this sort of marginalization, but it doesn’t prevent it from happening either. It’s not just a matter of editors with an agenda and time on their hands; it also happens because manually filing people under multiple categories takes more effort than filing them under one, and it’s easy to neglect or forget to put someone in a broader category after placing them in a narrower one. So people are classified under women authors but not authors, under chemists but not scientists, under Catholics but not Christians. Readers who look for articles in the more general category listings can easily miss people who are only filed in the more specific ones. (And even if those category listings were not originally intended for browsing, many Wikipedia readers do use them that way.)

In systems that have explicit hierarchies of categories (such as Wikipedia categories, or Library of Congress Subject Headings), there’s a fairly straightforward way to solve this particular problem: When a person is placed in a specific category, the system should automatically also place them in any broader categories of people that encompass the original category. If someone is categorized under “Women chemists”, for instance, they should also get automatically categorized under “Chemists”, “Women scientists”, and “Scientists”. This inclusion can be implemented in various ways, but the important thing is that narrowly-classified people should be just as visible for readers browsing the broader categories as people that were explicitly classified under the broader categories.

We could be doing this sort of thing in other library catalogs, and in Wikipedia, as well. Why aren’t we? I’ve seen a few objections to the idea:

It’s too hard to implement? It doesn’t have to be. It took me just part of a Sunday afternoon to implement the feature on The Online Books Page, and I suspect a good programmer who was familiar with (and could modify) the relevant source code would not have much trouble implementing the feature in a well-designed catalog or Wiki. In my experience, I had to spend more time modifying my data than modifying my code. The Library of Congress Subject Headings, the subject system used by The Online Books Page, is not complete or consistent in its subject hierarchies, and I had also miscoded some topical subjects as people. But it’s possible to clean up and enhance this kind of data, and doing so often benefits both present and future applications of the data.

It defeats the purpose of hierarchical categories? I’ve seen this objection made in some of the Wikipedia discussions around this issue, and it doesn’t make sense to me when I think it through. Far from being useless, the category hierarchy is precisely what makes it possible to automatically promote people in narrow categories into broader categories. It also helps save the time of categorizers; they only have to explicitly place people in precise categories, and if the hierarchy is well-constructed the system will automatically take care of the broader categories. (If the system also keeps track of which category assignments are explicit and which are automatic, it can also update them appropriately when categorizations or hierarchies get edited.) I’m also not flattening hierarchies across the board; I’m only recommending at this point that this sort of promotion be done for people, in categories of people. (More generally, it might be useful for any kind of individual instance that is categorized under abstract classes of those instances. But doing it for people is a good start.)

It makes the broader categories too crowded to be useful? In a comprehensive catalog such as Wikipedia, there will be a lot of people in categories like “writers”, once you include all the people in sub-categories. But there still will be a lot of people in that category even if you banish all the women to a “women writers” subcategory. Creating another category for “men writers” doesn’t really solve the problem; all it does is force people to choose which gender they want to browse, instead of letting them browse writers of both genders if that’s what they want to do. And after the split, the broader “writers” category will most likely still be left with a random assortment of writers without gender classification, who might or might not be the people a reader is most interested in.

Well-designed interfaces make it possible to usefully browse large collections of items. Relevance ranking, for instance, can be used to put the most notable examples of a category at the top of a long list of its members. That’s in fact what we routinely expect to happen in good search engines. And mechanisms like faceted navigation (used in manyonlinecatalogs) and subject maps (used on The Online Books Page) make it easy to shift focus to more precise or related categories based on a reader’s interests. In systems that implement these features, categories with lots of members are good things to have, not bad things.

I haven’t yet implemented relevance ranking in my subject browsing. Right now, The Online Books Page doesn’t actually classify many people to begin with, so most of my categories don’t have a lot of people in them. But I could see a number of ways to implement such ranking in a catalog like The Online Books Page, or in Wikipedia, which I can discuss later if there’s interest.

In summary, then, well-designed catalogs and wikis should be able to categorize people comprehensively without marginalizing them. Three features that make this possible are:

detailed, well-organized systems of categories and their relationships

systems that automatically show people in broader categories when they’re classified in narrower ones

and ranking and navigation mechanisms that make it easy to pick out the people with the most general interest, or the qualities of interest to a particular researcher, from a large overall set of people.

I’ll continue to work on implementing these features on The Online Books Page, and would be very interested in participating in discussions of how they can better work there, in other catalogs, and in systems like Wikipedia.

March 22, 2013

I’m gratified for the positive response I’ve been getting to the Forward To Libraries service I first introduced last month. It really took off when I announced the templates for linking to libraries from Wikipedia a couple of weeks ago. They’ve been written up in places like Boing Boing and in Wikipedia’s own Signpost newsletter. The service now includes more than 150 libraries throughout the English-speaking world. Various Wikipedia editors are also adding the link templates to various articles– besides the handful I added myself, more than 450 have been added by other editors at this writing. And I’ve heard from numerous librarians who now want to start editing Wikipedia themselves, both to add library links and to otherwise improve articles. (Here’s how to become a Wikipedia editor.)

So far, I’ve largely provided this service on my own, with support from the University of Pennsylvania Libraries. But I’d like to make the service more useful, and could use some help. If you’re interested, here are some things you might want to know:

Some libraries are easier to link than others. If you’re using one of many standard library catalogs or discovery systems, and you haven’t made substantial modifications to it, it’s easy for me to add your system. I basically just record what software you’re using and where on the Web the service runs, run some test searches to verify your system, and you’re good to go. If you’re using a more customized, obscure, or home-grown system, I might still be able to add links to it, but it may take me more effort to figure out how to make useful search links into the system. Any information you can provide would be helpful. There are also certain off-the-shelf systems that I have problems with. Many Polaris systems, for example, will give a “session timed out” message the first time you try to follow a search link into the system. (Back up and try the link again, and everything will be fine for some time afterwards.) Some other systems don’t seem to support deep search links in any consistent way that I’ve been able to determine, and not just some very old session-based systems, but also EBSCO’s fairly new EDS discovery platform.

I’ve determined ways to link into these various systems from reading various documentation files I’ve found on the public Internet, along with some reverse-engineering of public web sites. If you know of better ways to link to some of these systems that I haven’t yet figured out myself, and this information can be made public, let me know.

For now, I’m declining to list libraries that don’t have many English-language subject or Library of Congress name headings, because the results of English searches in those libraries will be misleadingly incomplete. But I’m considering ways to include translated searches, where the data to support this is available, for a wider range of countries. (VIAF already provides much relevant data for names.)

The most popular new Wikipedia Library resource template is also controversial, and might be modified or deleted. I provide a number of different templates for linking from Wikipedia to libraries, including the inlined text templates “Library resources about” and “Library resources by“, and the all-in-one sidebar template “Library resources box“. By far the most used of these templates has been the Library resources box. It’s easy to spot in an article, it organizes links clearly, and it’s easy for editors to recognize as a template that they can add to articles they find of interest. But some Wikipedians, including at least one Wikipedia admin, have objected to the template. They cite style guidelines that say external link templates should not use boxes or other graphical elements, but only appear as inlined text. I’ve defended the boxes, noted how other library-related external links commonly appear in boxes, and proposed ways to address various Wikipedian concerns. But it’s ultimately up to the Wikipedia community to determine whether or how library links will appear in Wikipedia articles. To find out more about the issues, see the Library resources box talk page. And if you’re a Wikipedia editor or user, feel free to weigh in on that page or other relevant forums.

I’m exploring ways to make it easier for readers to get to our libraries. For one, I’m starting to record IP ranges for some institutions, so that local network users can follow “resources in your library” links straight to the institution’s library, without having to first register a preference. (Users can still register a different preference if they want.) IP-based routing is an experimental service, initially being provided to a limited number of institutions, and I may modify or withdraw it in the future. If you’d like me to consider it for your institution, you can submit a request, with the relevant IP ranges (preferably in CIDR format) in the “anything we should know?” field. Note that the IP ranges you submit will be published as part of the library data I’m sharing for this project.

I’m starting to share my work on Github. There is now a Github repository with selected data and code for the FTL project. In it, you’ll find the data I use to link to the libraries enrolled in the service, and you’ll also see the code for the main CGI script used to forward readers to those libraries. You can’t yet run the service out of the box yourself with the code and data provided so far, but I hope that what’s there will help people understand how the service works, and possibly implement similar services themselves if they’re so inclined. The data’s released under CC0, so you can reuse it however you like; and the code is open-source licensed under the Educational Community License 2.0. I hope to add more data and code over time, and I’m happy to hear suggestions for enhancements and improvements.

I’m hoping that as more people get involved, the service will improve, library resources will become more reachable online, and Wikipedia will become a more useful resource as well. If you’d like to get involved yourself, I’d love to hear what you’re up to, and what suggestions you might have.

March 4, 2013

I’ve heard the lament in more than one library discussion over the years. “People aren’t coming to our library like they should,” librarians have told me. “We’ve got a rich collection, and we’ve expended lots of resources on an online presence, but lots of our patrons just go to Google and Wikipedia without checking to see what we have.” The pattern of quick online information-finding using search engines and Wikipedia is well-known enough that it has its own acronym: GWR, for Google -> Wikipedia -> References. (David White gives a good description of that pattern in the linked article.)

Some people I’ve talked to think we should break this pattern. With the right search tool or marketing plan, some say, we can get patrons to start with us first, instead of Google or Wikipedia. This idea seems to me both futile and beside the point. Between them, Google and Wikipedia cover a vast array of online information, more than librarians could hope to replicate or index ourselves in that medium. Also, if we truly have better resources available in our libraries than can be found on the open Web, it’s less important that our researchers start from our libraries’ websites than that they end up finding the knowledge resources our libraries make available to them.

Looked at the right way, Wikipedia can be a big help in making online readers aware of their library’s offerings. One of the things we spend a lot of time on in libraries is organizing information into distinct, conceptual categories. That’s what Wikipedia does too: so far, their English edition has over 4 million concepts identified, described, and often populated with reference links. And Wikipedia has encouraged people to add links to relevant digital library collections on various topics, through programs like Wikipedia Loves Libraries and Wikipedian in Residence programs. But while these programs help bring some library resources online, and direct people to those selected resources, there’s still a lot of other relevant library material that users can’t get to via Wikipedia, but can via the libraries that are near them.

So how do we get people from Wikipedia articles to the related offerings of our local libraries? Essentially we need three things: First, we need ways to embed links in Wikipedia to the libraries that readers use. (We can’t reasonably add individual links from an article to each library out there, because there are too many of them– there has to be a way that each Wikipedia reader can get to their own favored libraries via the same links.) Second, we need ways to derive appropriate library concepts and local searches from the subjects of Wikipedia articles, so the links go somewhere useful. Finally, we need good summaries of the resources a reader’s library makes available on those concepts, so the links end up showing something useful. With all of these in place, it should be possible for researchers to get from a Wikipedia article on a topic straight to a guide to their local library’s offerings on that topic in a single click.

I’ve developed some tools to enable these one-click Wikipedia -> library transitions. For the first thing we need, I’ve created a set of Wikipedia templates for adding library links. The documentation for the Library resources box template, for instance, describes how to use it to create a sidebar box with links to resources about (or by) the topic of a Wikipedia article in a reader’s library, or in another library a reader might want to consult. (There’s also an option for direct links to my Online Books Page, if there are relevant books online; it may be easier in some cases for readers to access those than to access their local library’s books.)

For the links to work, we need to know about the reader’s preferred library. Users can register their preferred library (which will set a cookie in their browser recording that choice), or select it for each individual search. We know how to link to several dozen libraries so far, and can add more libraries on request. Worldcat.org, which includes holdings of thousands of libraries worldwide, is also an option. Besides the “Library resources box” template, I’ve also provided templates for in-text links to library resources, if those work better in a given article. Links to these templates can be found at the end of the “Library resources box” documentation.

For the second thing we need, I’ve created a library forwarding service (“Forward to Libraries”, or FTL– catchier name suggestions welcome) that transforms links from Wikipedia into searches for appropriate headings or keywords in local libraries. This is the same service I describe in my “From my library to yours” blog post from last month, but it now supports links from Wikipedia as well as to Wikipedia.

Thanks to information included in the Library of Congress’ Authorities and Vocabularies datasets, OCLC’s VIAF data feeds, Wikipedia’s database downloads, and my own metadata compiled at The Online Books Page, FTL already knows how to link directly to over 240,000 distinct authority-controlled headings known to the Library of Congress from their corresponding Wikipedia articles. (Library of Congress headings are used in most sizable US libraries, and many English-language libraries outside the US also use similar headings.)

For other articles, FTL by default will try a general keyword search based on the Wikipedia article’s title, which will often turn up useful results at the destination library. Alternatively, my templates allow Wikipedia editors to determine a specific Library of Congress heading to use in library links, if appropriate. I’m hoping to incorporate suggested headings into FTL’s own knowledge base as I detect them showing up in Wikipedia articles. I also plan to publish FTL’s data sets under open access terms, so that others can use and improve on them as well.

The third part of this solution– displaying relevant resources at the destination library– can be implemented differently at each library. For most of the libraries in FTL’s current knowledge base, links go to searches in the library’s regular online catalog. But with some libraries, I’ve linked to another discovery system, if it seems to be the main search promoted at that library, and it seems to produce useful results. The Online Books Page’s subject map displays also have features that I think will be useful to Wikipedia subject researchers arriving at my site, such as also showing related subjects and books filed under those subjects. I hope in future posts to talk more about other useful guideposts and contextual information we could be providing to readers arriving from Wikipedia.

But if you’ve read this far, you probably want to see how this all works in practice. So I’ve added some example library resources boxes in a few Wikipedia articles that seemed particularly relevant this month, including those for Women’s history, Elizabeth Cady Stanton, and Flannery O’Connor. Look down in the “External links” or “Further reading” sections of those articles for the boxes, and view the page source of the articles to see how those boxes are constructed.

As with most things related to Wikipedia, this service is experimental, and subject to change (and, hopefully, improvement) over time. I’d love to hear thoughts and suggestions from users and maintainers of Wikipedia and libraries. And if you find creating these sort of links from Wikipedia useful, and need help getting started, I’d be happy to help you bring them to your favorite Wikipedia topics and local libraries, as time permits.

February 11, 2013

Even with well over one and half million books and serials, the collection I maintain at The Online Books Page is far from comprehensive. The gaps in coverage are not hard to notice at sites like mine, because most material published under copyright– which can be as much as 90 years old at this point– is not made freely available online. But all libraries, no matter how large or well-provisioned, have their gaps. No one can collect everything, and a persistent reader or researcher will eventually find that their questions and interests go beyond the bounds of any particular collection.

However, there are lots of libraries out there, as well as lots of online information and literature that hasn’t been collected into an institutional library. A good library, of whatever size, serves its users well by collecting the most useful materials it can get for their needs, and helping them get whatever else they need in other places. Jeff Jarvis expressed this basic idea well a few years ago when discussing news organizations: “Cover what you do best. Link to the rest.”

Many libraries already do this, in certain ways. The inter-library loan system helps library users who know they want a particular title their own library doesn’t have. Many libraries also maintain links to websites on various topics from their own library website or catalog. But these links, often maintained separately by each library, can only cover so much ground, as librarians have limited time to collect and maintain links. Even consortiallymaintainedcollections of links struggle to go beyond fairly generalized or particular-niche focuses, and stay current.

Libraries can do more, though. People coming to a library often have a particular topic in mind that they want to learn or read more about. They’re often looking for something they can pick up quickly, and for free. Knowing what that topic is, we should be able to point them towards useful literature they can quickly and freely obtain, whether or not it’s a title they already had in mind, and whether or not it’s in our own collection or something we link to directly. That’s the purpose of some new links now available on The Online Books Page.

For example, say you’re a high school student looking for books on the Underground Railroad. If you browse to this subject on the Online Books Page, you’ll find a number of free online books I list on this topic, and related topics. As before, you can explore those related topics, if you’re interested (maybe checking out fugitive slave biographies, for instance); or you can try digging deeper for books specifically on the Underground Railroad via the extended shelves.

But most of what you’ll find on my site will be 19th century and early 20th century materials. Your local library is likely to have books you can freely read as well, reflecting more up-to-date historical research, as well as books that might be more accessible to a high school student. There might also be useful research materials online that you can look at for free.

That’s why there’s a new “See also…” note just under the big “Underground Railroad” heading. If you click on the words “your library” in that note, you’ll be referred to your regular library, if we know about it, to see what they have on the Underground Railroad. (If you haven’t already told us which local library you want to use regularly, we give you a list of choices. It’s a pretty small list to start with, but I’m taking requests for more libraries to add. Or you can opt for OCLC’s Worldcat.org- they cover lots of libraries throughout North America and beyond.) Even after you register a preferred library, you’re not stuck with only using that one. You can click on the “elsewhere” link in the note to try a different library or service from the one you usually check– like maybe the university library that’s near your public library (or vice versa).

You might also want to find online research resources that aren’t books. For some of those, try clicking on the Wikipedia link provided for this subject. While the quality and reliability of Wikipedia articles themselves can vary, most mature Wikipedia entries include a rich set of useful links to more information. (I’ve discussed previously how useful Wikipedia is as a concept-oriented catalog.) The references and external links on Wikipedia’s Underground Railroad article, for instance, cover a wide range of informational websites, contemporary and current books, and digital library collections.

Similarly, if you’re looking at a list of online books by a particular author (like, say, W. E. B. Dubois), you’ll find a link at the bottom of the page to find more books by the author in libraries, as well as links to online books or Wikipedia articles about the author near the top. There are also links to find library copies of a particular book on its detailed catalog page; see for instance, the links at the bottom of our catalog entry for The Souls of Black Folk. This can be useful for people who want a print copy, or a different edition from the ones we list.

So far, I’ve added links from The Online Books Page to Wikipedia for more than 17,000 subjects, and links to library catalogs for millions of subjects, authors, and titles. (My thanks to OCLC, the Library of Congress, and Wikipedia for providing bulk access to the data that makes it possible to do much of this automatically.) I’ll be developing this service further, and doing more things with this data, in ways that I hope to describe here shortly. But I hope this first step is a useful demonstration of ways that different kinds of libraries and catalogs– online and local, academic and public, institutional and informal– can support each other through user-directed, context-sensitive, concept-level links between collections.

January 1, 2013

The first day of the new year is Public Domain Day, when many countries celebrate a year’s worth of copyrights expiring, and the associated works become freely available for anyone to share and adapt. As the Public Domain Day page at Duke’s Center for the Public Domain notes, the United States once again does not have much to celebrate. Except for unpublished works by authors who died in 1942, no copyrights expire in the US today. Under current law, Americans still have to wait 6 more years before any more copyrights of published works will expire. (Subsisting copyrights from 1923 are scheduled to finally enter the public domain at the start of 2019.)

The start of 2013 is more significant in Europe, where the Open Knowledge Foundation has a more upbeat Public Domain Day site featuring authors who died in 1942, and whose published works enter the public domain today in most of the European Union. But that isn’t actually breaking new ground in most of Europe, because 2013 is also the 20th anniversary of the 1993 European Union Copyright Duration Directive, which required European countries to retroactively extend their copyright terms from the Berne Convention‘s “life of the author plus 50 years” to “life of the author plus 70 years”, and put 20 years’ worth of public domain works back into copyright in those countries.

For countries that used the Berne Convention’s term and implemented the directive right away, today marks the day that the public domain finally returns to its maximum extent of 20 years ago. Only next year will Europe start seeing truly new public domain works. (And since many European countries took a couple of years or more to implement the directive– the UK implemented it at the start of 1996, for instance– it may still be a few years yet before their public domain is back again to what it once was.)

At least the last US copyright extension, in 1998, only froze the public domain, without rolling it back. If the US had not passed that extension, we would be seeing works published in 1937, such as the first edition of J.R.R. Tolkien’s The Hobbit, now entering the public domain. (If the US hadn’t made any post-publication extensions, we’d also have the more familiar revision of The Hobbit, in which Gollum does not voluntarily give Bilbo the Ring, in the public domain now as well, along with all three volumes of The Lord of the Rings.) Folks in Canada and other “life+50 years” countries, now celebrating the public domain status of works by authors who died in 1962, may be able to freely share and adapt Tolkien’s works in another 11 years. Folks in Europe and the US who’d like to see a variety of visual adaptations, though, will have to content themselves with the estate-licensed Peter Jackson and Rankin/Bass adaptations for a while to come.

But there are still things Americans can do to make today meaningful. For the last few years, I’ve been releasing copyrights I control into the public domain after 14 years (the original term of copyright set by the country’s founders, with an option to renew for another 14). So today, I dedicate all such copyrights for works I published in 1998 to the public domain. This includes my computer science doctoral dissertation, Mediating Among Diverse Data Formats. If I believed a recent fearmongering statement from certain British journal editors, I should be worried about plagiarism resulting from this dedication, which doesn’t even have the legal attribution requirement of the CC-BY license they decry. But as I’ve explained in a previous post on plagiarism, plagiarism is fundamentally an ethical rather than a legal matter, and scholars can no more get away with plagiarizing public domain material than they can with copyrighted material. Both are and should be a career-killer in academia.