The symposium is really worth reading from start to finish. (Alas, one of the drawbacks of hosting a symposium on a blog is that it keeps everything in reverse chronological order; it would be great if CITP could flip the posts now that the discussion has ended.) But for those of us in the humanities the most relevant point is that we are going to have a much harder transition to an online model of scholarship than in the sciences. The main reason for this is that for us the highest form of scholarship is the book, whereas in the sciences it is the article, which is far more easily put online, posted in various forms (including as pre- and e-prints), and networked to other articles (through, e.g., citation analysis). In addition, we’re simply not as technologically savvy. As Paul DiMaggio points out, “every computer scientist who received his or her Ph.D. in computer science after 1980 or so has a website” (on which they can post their scholarly production), whereas the number is about 40% for political scientists and I’m sure far less for historians and literature professors.

I’m planning a long post in this space on the possible ways for humanities professors to move from print to open online scholarship; this discussion is great food for thought.

[This post is a version of a message I sent to the listserv for CenterNet, the consortium of digital humanities centers. Google has expressed interest in helping CenterNet by providing a (limited) corpus of full texts from their Google Books program, but I have been arguing for an API instead. My sense is that this idea has considerable support but that there are also some questions about the utility of an API, including from within Google.]

My argument for an API over an extracted corpus of books begins with a fairly simple observation: how are we to choose a particular dataset for Google to compile for us? I’m a scholar of the Victorian era, so a large corpus from the nineteenth century would be great, but how about those who study the Enlightenment? If we choose novels, what about those (like me) who focus on scientific literature? Moreover, many of us wish to do more expansive horizontal (across genres in a particular age) and vertical (within the same genre but through large spans of time) analyses. How do we accommodate the wishes of everyone who does computational research in the humanities?

Perhaps some of the misunderstanding here is about the kinds of research a humanities scholar might do as opposed to, say, the computational linguist, who might make use of a dataset or corpus (generally a broad and/or normalized one) to assess the nature of (a) language itself, examine frequencies and patterns of words, or address computer science problems such as document classification. Some of these corpora can provide a historian like me with insights as long as the time span involved is long enough and each document includes important metadata such as publication date (e.g., you can trace the rise and fall of certain historical themes using BYU’s Time Magazine corpus).

But there are many other analyses that humanities scholars could undertake with an API, especially one that allowed them to first search for books of possible interest and then to operate on the full texts of that ad hoc corpus. An example from my own research: in my last book I argued that mathematics was “secularized” in the nineteenth century, and part of my evidence was that mathematical treatises, which normally contained religious language in the early nineteenth century, lost such language by the end of the century. By necessity, researching in the pre-Google Books era, my textual evidence was limited–I could only read a certain number of treatises and chose to focus on the writing of high-profile mathematicians.

How would I go about supporting this thesis today using Google Books? I would of course love to have an exhaustive corpus of mathematical treatises. But in my book I also used published books of poems, sermons, and letters about math. In other words, it’s hard to know exactly what to assemble in advance–just treatises would leave out much of the story and evidence.

Ideally, I would like to use an API to find books that matched a complicated set of criteria (it would be even better if I could use regular expressions to find the many variants of religious language and also to find religious language relatively close to mentions of mathematics), and then use get_cache to acquire the full OCRed text of these matching books. From that ad hoc corpus I would want to do some further computational analyses on my own server, such as extracting references to touchstones for the divine vision of mathematics (e.g., Plato’s later works, geometry rather than number theory), and perhaps even do some aggregate analyses (from which works did British mathematicians most often acquire this religious philosophy of mathematics?). I would also want to examine these patterns over time to see if indeed the bond between religion and mathematics declined in the late Victorian era.

This is precisely the model I use for my Syllabus Finder. I first find possible syllabi using an algorithm-based set of searches of Google (via the unfortunately deprecated SOAP Search API) while also querying local Center for History and New Media databases for matches. Since I can then extract the full texts of matching web pages from Google (using the API’s cache function), I can do further operations, such as pulling book assignments out of the syllabi (using regular expressions).

It seems to me that a model is already in place at Google for such an API for Google Books: their special university researcher’s version of the Search API. That kind of restricted but powerful API program might be ideal because 1) I don’t think an API would be useful without the get_OCRed_text function, which (let’s face it) liberates information that is currently very hard to get even though Google has recently released a plain text view of (only some of) its books; and 2) many of us want to ping the Google Books API with more than the standard daily hit limit for Google APIs.

[Image credit: the best double-entendre cover I could find on Google Books: No Way Out by Beverly Hastings.]

I’ve spent the past two weeks trying to get a better understanding of the agreement signed by the National Archives and Footnote, about which I raised several concerns in my last post. Before making further (possibly unfounded) criticisms I thought it would a good idea to talk to both NARA and Footnote. So I picked up the phone and found several people eager to clarify things. At NARA, Jim Hastings, director of access programs, was particularly helpful in explaining their perspective. (Alas, NARA’s public affairs staff seemed to have only the sketchiest sense of key details.) Most helpful—and most eager to rebut my earlier post—were Justin Schroepfer and Peter Drinkwater, the marketing director and product lead at Footnote. Much to their credit, Justin and Peter patiently answered most of my questions about the agreement and the operation of the Footnote website.

Surprisingly, everyone I spoke to at both NARA and Footnote emphasized that despite the seemingly set-in-stone language of the legal agreement, there is a great deal of latitude in how it is executed, and they asked me to spread the word about how historians and the general public can weigh in. It has received virtually no publicity, but NARA is currently in a public comment phase for the Footnote (a/k/a iArchives) agreement. Scroll down to the bottom of the “Comment on Draft Policy” page at NARA’s website and you’ll find a request for public comment (you should email your thoughts to Vision@nara.gov). It’s a little odd to have a request for comment after the ink is dry on an agreement or policy, and this URL probably should have been included in the press release of the Footnote agreement, but I do think after speaking with them that both NARA and Footnote are receptive to hearing responses to the agreement. Indeed, in response to this post and my prior post on the agreement, Footnote has set up a web page, “Finding the Right Balance,” to receive feedback from the general public on the issues I’ve raised. They also asked me to round up professional opinion on the deal.

I assume Footnote will explain their policies in greater depth on their blog, but we agreed that it would be helpful to record some important details of our conversations in this space. Here are the answers Justin and Peter gave to a few pointed questions.

When I first went to the Footnote site, I was unpleasantly surprised that it required registration even to look at “milestone” documents like Lincoln’s draft of the Gettysburg Address. (Unfortunately, Footnote doesn’t have a list of all of its free content yet, so it’s hard to find such documents.) Justin and Peter responded that when they launched the site there was an error in the document viewer, so they had to add authentication to all document views. A fix was rolled out on January 23, and it’s now possible to view these important documents without registering.

You do need to register, however, to print or download any document, whether it’s considered “free” or “premium.” Why? Justin and Peter candidly noted that although they have done digitization projects before, the National Archives project, which contains millions of critical—and public domain—documents, is a first for them. They are understandably worried about the “leakage” of documents from their site, and want to take it one step at a time. So to start they will track all downloads to see how much escapes, especially in large batches. I noted that downloading and even reusing these documents (even en masse) very well might be legal, despite Footnote’s terms of service, because the scans are “slavish” copies of the originals, which are not protected by copyright. Footnote lawyers are looking at copyright law and what other primary-source sites are doing, and they say that they view these initial months as a learning experience to see if the terms of service can or should change. Footnote’s stance on copyright law and terms of usage will clearly be worth watching.

Speaking of terms of usage, I voiced a similar concern about Footnote’s policies toward minors. As you’ll recall, Footnote’s terms of service say the site is intended for those 18 and older, thus seeming to turn away the many K-12 classes that could take advantage of it. Justin and Peter were most passionate on this point. They told me that Footnote would like to give free access to the site for the K-12 market, but pointed to the restrictiveness of U.S. child protection laws. Because the Footnote site allows users to upload documents as well as view them, they worry about what youngsters might find there in addition to the NARA docs. These laws also mandate the “over 18” clause because the site captures personal information. It seems to me that there’s probably a technical solution that could be found here, similar to the one PBS.org uses to provide K-12 teaching materials without capturing information from the students.

Footnote seems willing to explore such a possibility, but again, Justin and Peter chalked up problems to the newness of the agreement and their inexperience running an interactive site with primary documents such as these. Footnote’s lawyers consulted (and borrowed, in some cases) the boilerplate language from terms of service at other sites, like Ancestry.com. But again, the Footnote team emphasized that they are going to review the policies and look into flexibility under the laws. They expect to tweak their policies in the coming months.

So, now is your chance to weigh in on those potential changes. If you do send a comment to either Footnote or NARA, try to be specific in what you would like to see. For instance, at the Center for History and New Media we are exploring the possibility of mining historical texts, which will only be possible to do on these millions of NARA documents if the Archives receives not only the page images from Footnote but also the OCRed text. (The handwritten documents cannot be automatically transcribed using optical character recognition, of course, but there are many typescript documents that have been converted to machine-readable text.) NARA has not asked to receive the text for each document back from Footnote—only the metadata and a combined index of all documents. There was some discussion that NARA is not equipped to handle the flood of data that a full-text database would entail. Regardless, I believe it would be in the best interest of historical researchers to have NARA receive this database, even if they are unable to post it to the web right away.

I suppose it’s not breaking news that libraries and archives aren’t flush with cash. So it must be hard for a director of such an institution when a large corporation, or even a relatively small one, comes knocking with an offer to digitize one’s holdings in exchange for some kind of commercial rights to the contents. But as a historian worried about open access to our cultural heritage, I’m a little concerned about the new agreement between Footnote, Inc. and the United States National Archives. And I’m surprised that somehow this agreement has thus far flown under the radar of all of those who attacked the troublesome Smithsonian/Showtime agreement. Guess what? From now until 2012 it will cost you $100 a year, or even more offensively, $1.99 a page, for online access to critical historical documents such as the Papers of the Continental Congress.

This was the agreement signed by Archivist of the United States Allen Weinstein and Footnote, Inc., a Utah-based digital archives company, on January 10, 2007. For the next five years, unless you have the time and money to travel to Washington, you’ll have to fork over money to Footnote to take a peek at Civil War pension documents or the case files of the early FBI. The National Archives says this agreement is “non-exclusive”—I suppose crossing their fingers that Google will also come along and make a deal—but researchers shouldn’t hold their breaths for other options.

Footnote.com, the website that provide access to these millions of documents, charges for anything more than viewing a small thumbnail of a page or photograph. Supposedly the value-added of the site (aside from being able to see detailed views of the documents) is that it allows you to save and annotate documents in your own library, and share the results of your research (though not the original documents). Hmm, I seem to remember that there’s a tool being developed that will allow you to do all of that—for free, no less.

Moreover, you’ll also be subject to some fairly onerous terms of usage on Footnote.com, especially considering that this is our collective history and that all of these documents are out of copyright. (For a detailed description of the legal issues involved here, please see Chapter 7 of Digital History, “Owning the Past?”, especially the section covering the often bogus claims of copyright on scanned archival materials.) I’ll let the terms speak for themselves (plus one snide aside): “Professional historians and others conducting scholarly research may use the Website [gee, thanks], provided that they do so within the scope of their professional work, that they obtain written permission from us before using an image obtained from the Website for publication, and that they credit the source. You further agree that…you will not copy or distribute any part of the Website or the Service in any medium without Footnote.com’s prior written authorization.”

Couldn’t the National Archives have at least added a provision to the agreement with Footnote to allow students free access to these documents? I guess not; from the terms of usage: “The Footnote.com Website is intended for adults over the age of 18.” What next? Burly bouncers carding people who want to see the Declaration of Independence?

One seemingly minor aspect of blogs I failed to consider carefully when I programmed this site was the composition of its feed. (Frankly, I was more concerned with the merely technical question of how to write code that spits out a valid RSS or Atom feed.) Looking at a lot of blogs and their feeds, I just assumed that the standard way of doing it was to put a small part of the full post in the feed—e.g., the first 50 words or the first paragraph—and then let the reader click through to the full post on your site. I noticed that some bloggers put their entire blog in their feed, but as a new blogger—one who had just spent a lot of time redesigning his old website to accommodate a blog—I couldn’t figure out why one would want to do that since it rendered your site irrelevant. It may seem minor, but a year later I’ve realized that there is, in part, a philosophical difference between a full and partial feed. Choosing which type of feed you are going to use means making a choice about the nature of your blog—and, surprisingly, the nature of your ego too. Subscribers to this blog’s feed have probably noticed that as of my last post I’ve switched from a partial feed to a full feed, so you already know the outcome of the debate I’ve had in my head about this distinction, but let me explain my reasoning and the advantages and disadvantages of full and partial feeds.

Putting the entire content of your blog into your feed has many practical advantages. Most obviously, it saves your readers the extra step of clicking on a link in their feed reader to view your full post. They can read your blog offline as well as online, and more easily access it on a non-computer device like a cell phone. Machine audiences can also take advantage of the full feed, searching it for keywords desired by other machines or people. For instance, most blog search engines allow you to set up feeds for posts from any blogger that contain certain words or phrases.

More important, providing a full feed conforms better with a philosophy I’ve tried to promote in this space, one of open access and the sharing of knowledge. A full feed allows for the easy redistribution of your writing and the combination of your posts with others on similar topics from other bloggers. A full feed is closer to “open source” than a feed that is tied to a particular site. For this reason, until the advent of in-feed advertising, most professional bloggers had partial feeds so readers would have to view advertising next to the full text of a post.

Even from the perspective of a non-commercial blogger—or more precisely the perspective of that blogger’s ego—full feeds can be slightly problematic. A liberated, full feed is less identifiably from you. As literary theorists know well, reading environments have a significant impact on the reception of a text. A full feed means that most of your blog’s audience will be reading it without the visual context of your site (its branding, in ad-speak), instead looking at the text in the homogenized reading environment of a feed reader. I’ve just switched from NetNewsWire to Google Reader to browse other blogs, and I especially like the way that Google’s feed reader provides a seamless stream of blog posts, one after the other, on a scrolling web page. I’m able to scan the many blogs I read quickly and easily. That reading style and context, however, makes me much less aware of specific authors. It makes the academic blogosphere seem like a stream of posts by a collective consciousness. Perhaps that’s fine from an information consumption standpoint, but it’s not so wonderful if you believe that individual voices and perspectives matter a great deal. Of course, some writers cut through the clutter and make me aware of their distinctive style and thoughts, but most don’t.

At the Center for History and New Media, we’ve been thinking a lot about the blog as a medium for academic conversation and publication—and even promotion and tenure—and the homogenized feed reader environment is a bit unsettling. Yes, it can be called academic narcissism, but maintaining authorial voice and also being able to measure the influence of individual voices is important to the future of academic blogging.

I’ve already mentioned in this space that I would like to submit this blog as part of my tenure package, for my own good, of course, but also to make a statement that blogs can and should be a part of the tenure review process and academic publication in general. But tenure committees, which generally focus on peer-reviewed writing, will need to see some proof of a blog’s use and impact. Right now the best I can do is to provide some basic stats about the readership of this blog, such as subscriptions to the feed.

But with a full feed, you can slowly loose track of your audience. Providing your entire posts in the feed allows anyone to resyndicate it, aggregate it, mash it up, or simply copy it. I must admit, I am a little leery of this possibility. To be sure, there are great uses for aggregation and resyndication. This blog is resyndicated on a site dedicated to the future of the academic cyberinfrastructure, and I’m honored that someone thought to include this modest blog among so many terrific blogs charting the frontiers of libraries, technology, and research. On the other hand, even before I started this blog I had experiences where content from my site appeared somewhere else for less virtuous reasons. I don’t have time to tell the full story here, but in 2005 an unscrupulous web developer used text from my website and a small trick called a “302 redirect” to boost the Google rankings of one of his clients. It was more amusing than infuriating—for a while a dentist in Arkansas had my bio instead of his. More seriously, millions of spam blogs scrape content from legitimate blogs, a process made much easier if you provide a full feed. And there are dozens of feed aggregators that will create a website from other people’s content without their permission. Regardless of the purpose, above board or below, I have no way of knowing about readers or subscribers to my blog when it appears in these other contexts.

But these concerns do not outweigh the spirit and practical advantages of a full feed. So enjoy the new feed—unless you’re that Arkansas dentist.

I recently polled my graduate students to see where they turn to begin research for a paper. I suppose this shouldn’t come as a surprise: the number one answer—by far—was Google. Some might say they’re lazy or misdirected, but the allure of that single box—and how well it works for most tasks—is incredibly strong. Try getting students to go to five or six different search engines for gated online databases such as ProQuest Academic and JSTOR—all of which have different search options and produce a complex array of results compared to Google. I was thinking about this recently as I tested the brand new scholarly search engine from Microsoft, Windows Live Academic. Windows Live Academic is a direct competitor to Google Scholar, which has been in business now for over a year but is still in “beta” (like most Google products). Both are trying to provide that much-desired single box for academic researchers. And while those in the sciences may eventually be happy with this new option from Microsoft (though it’s currently much rougher than Google’s beta, as you’ll see), like Google Scholar, Windows Live Academic is a big disappointment for students, teachers, and professors in the humanities. I suspect there are three main reasons for this lack of a high-quality single box humanities search.

First, a quick test of Google Scholar and Windows Live Academic. Can either one produce the source of the famous “frontier thesis,” probably the best-known thesis in American historiography?

Clearly, the usefulness of these search results are dubious, especially Windows Live Academic (The Political Economy of Land Conflict in the Eastern Brazilian Amazon as the top result?). Why can’t these giant companies do better than this for humanities searches?

Obviously, the people designing and building these “academic” search engines are from a distinct subset of academia: computer science and mathematical fields such as physics. So naturally they focus on their own fields first. Both Google Scholar and Windows Live Academic work fairly well if you would like to know about black holes or encryption. Moreover, “scholarship” in these fields generally means articles, not books. Google Scholar and Windows Live Academic are dominated by journal-based publications, though both sometimes show books in their search results. But when Google Scholar does so, these books seem to appear because articles that match the search terms cite these works, not because of the relevance of the text of the books themselves.

In addition, humanities articles aren’t as easy as scientific papers to subject to bibliometrics—methods such as citation analysis that reveal the most important or influential articles in a field. Science papers tend to cite many more articles (and fewer books) in a way that makes them subject to extensive recursive analysis. Thus a search on “search” on Google Scholar aptly points a researcher to Sergey Brin’s and Larry Page’s seminal paper outlining how Google would work, because hundreds of other articles on search technology dutifully refer to that paper in their opening paragraph or footnote.

Most important, however, is the question of open access. Outlets for scientific articles are more open and indexable by search engines than humanities journals. In addition to many major natural and social science journals, CiteSeer (sponsored by Microsoft) and ArXiv.org make hundreds of thousands of articles on computer science, physics, and mathematics freely available. This disparity in openness compared to humanities scholarship is slowly starting to change—the American Historical Review, for instance, recently made all new articles freely available online—but without a concerted effort to open more gates, finding humanities papers through a single search box will remain difficult to achieve. Microsoft claims in its FAQ for Windows Live Academic that it will get around to including better results for subjects like history, but like Google they are going to have a hard time doing that well without open historical resources.

UPDATE [18 April 2006]: Microsoft has contacted me about this post; they are interested in learning more about what humanities scholars expect from a specialized academic search engine.

UPDATE [21 April 2006]:Bill Turkel makes the great point that Google’s main search does a much better job than Google Scholar at finding the original article and author of the frontier thesis: