Archive for August, 2009

Eleanor called my attention to the fact that ECCO provides a list of the most common search terms by quarter. When I looked this up, I found that the most frequent search term last quarter was “Gold,” with 5981 searches. The next most popular searches were

Sleep (5829 searches)

America (3110 searches)

Woman (2520 searches)

Our ongoing discussion of searching methods in Burney (made possible by free access to the Burney Collection of Newspapers through October 30 via http://access.gale.com/emob) has been productive and will, I hope, continue. As we discuss Burney, I am also curious how best to approach searching ECCO. Do these search terms—“Gold,” “Sleep,” “America,” “Woman”—tell us anything about how scholars search ECCO? Are there particular methods that work?

The Association for College and Research Libraries blog has an overview of a recent conference discussion at Princeton University of “real-time libraries” that may be of interest to readers of this blog, if only for the very different set of concerns this discussion brings to online reference. The focus on the social networking made possible by the “real-time web” makes sense, given librarians’ crucial role in conveying information to students and faculty regarding online reference. Scholars interested in genuine dialogue with librarians will need to become familiar with those concerns. An overview of Stephen Francouer’s presentation on digital reference in academic libraries can be found at http://acrlog.org/2009/08/25/the-real-time-library/.

Since Anna requested this, I’m letting people take a peek at my course-blogsyllabus for my Jane Austen and the Undergraduate Novel Course for the next few days; I’ll have to shut down access after then, as soon as students begin having their discussions. I’m still working on the blog, but the The syllabus and resource page will should give you at least an idea of what I’m up to. I expect I’ll build some of the Burney assignments into their weekly blogging assignment.

I found this post from Rachel at A Historian’s Craft (via Carnivalesque 52) a while back, and thought it would be a useful way to discuss the Burney collection and its potential for the classroom. Frankly, since I had already spent part of the summer reading Scottish newspapers in Edinburgh, I was very interested in what Rachel had to say about the best ways to plow through such materials.

I think the best advice in Rachel’s post is to prepare a list of themes or events to use while browsing, since it’s so easy to get lost in the columns and columns of details. This would be expecially important for students, if you expected them to find anything relevant to a particular novel.

I also agree with Rachel that the letters and advertisements in newspapers are probably the most interesting to us as researchers, because they are the most human, least standardized elements of a very standardized medium. They provide a period flavor to readers that other parts of the paper do not, largely because they contain such a concentration of “everyday life” and its unspoken/barely spoken assumptions. I suspect that for a novel class, these would often be the most important parts.

Since I got access to the Burney, I’ve been playing around with the keyword searching, figuring out the types of assignments that would work best for my Austen and her Predecessors novel course, and this is what I’m thinking:

keyword searching in newspapers works really well for author/work information, since it is mostly contained in advertisements. I’d pair this up wth the Oxford Dictionary of National Biography, to see if students could compare the publication information they find in the newspapers with what they find in the bio.

advertisements also yield good contextual clues for everyday products or practices unlikely to be fully glossed. So, for example, I found some good ads for “masquerades” and “masquerade-makers” that would be useful for readers of Fantomina. Students are probably best off getting these kinds of keywords assigned to them, at least initially. I’d pair this exercise with a period dictionary, to see if the terms coincide or diverge.

I think historical events, if they could be named with some precision, could be usefully glossed using the Burney. Unfortunately, many of the novels that we’re reading (Haywood and Davys, for example) are less interested in such “realism,” though that of course makes for another point of entry into a discussion of such issues as realism. And I’d endorse prefacing any use of the Burney with a discussion of realism and the critical debates surrounding its “rise,” including the Campbell article, I suppose.

A more general way to approach this kind of historicization, though, would be to assign students the task of finding the first advertisement of the assigned novel, then browsing the issue of the newspaper in which it occurred, to see what historical events, political debates, etc. are occurring at the moment of its first appearance. If you were doing this, you would be facing a “stump the prof” style exercise unless you were fully prepared before they undertook their researches (not a bad thing, actually). It would be interesting to compare their newspapers’ versions of that year with a typical scholarly chronology, and discuss the differences.

It would also be useful to see if you could get students to find real-world analogues to situations in the novels, but this would take some experience and direction, I think. It might also work better if teachers found such an analogue ahead of time, and used it for discussion.

Overall, the effect of the Burney searches is pointillistic: you get details, very much embedded in local contexts, without much explanation of their significance. So the kind of general question that a student might have, like, “why doesn’t Fantomina get married at the end?” will not get addressed by this kind of research activity. But it would be interesting to see how one could use this resource to investigat the multifactedness of eighteenth-century marriages, for example. This would require a series of directed prompts, I think.

As I read over my bullet-points, I’m noticing that the best uses of Burney would entail pairing it up with other kinds of resources (ODNB, dictionaries, chronlogies, etc.) so that students could follow up on what they found in Burney with additional information.

So these are some of my initial reactions. What do the rest of you think?

Anna Battigelli and Eleanor Shevlin invited me to write a bit about the Eighteenth-Century Book Tracker project that Laura Mandell linked to last week, and I’m happy to do so.

This is a project I began thinking about around a year ago, and to explain some of its premises, I’d best say a bit about the circumstances that gave rise to it. I teach at a mid-sized, primarily undergraduate public university that hasn’t purchased access to ECCO, EEBO, et. al. and, realistically speaking, isn’t ever going to purchase access to them at their current prices. I’m really fortunate to be able to use ECCO and other resources at the University of Connecticut, just a few miles up the road, so my own research isn’t unduly hampered by not having them at my home institution. (What hampers my research is my 4/4 teaching load, but that’s another matter…) I can’t really take advantage of ECCO in my teaching, though, which led me to start exploring resources like Google Books and the Internet Archive. While you can’t beat the price, those sites—and, let’s recall, they’re functionally the only ones that people without institutional access to the big databases can leverage—leave a lot to be desired.

There’s been a lot of good discussion here about the nature of Google Books and the Internet Archive—what they are and aren’t good for, how best to think about them, whether as catalogues/finding aids or as searchable textbases. I hope it won’t seem too contrary of me, then, to say that, at present, they aren’t especially good at being either of those things.

Gale/Cengage has generously agreed to offer a free trial of the Burney Collection for readers of this blog at http://access.gale.com/emob. This provides us with an opportunity for an open discussion of the Burney Collection’s merits, both as a scholarly resource and as a pedagogical tool.

In preparation for the two sessions on digital text-bases, it would be interesting to hear more about how users search Burney. Search results can be overwhelming and show the need for the Library of Congress cataloguing and classification system to help categorize and make sense of the wealth of data that emerges from any given search. Thomas Mann, a Reference Librarian at the Library of Congress, has a still useful 2005 discussion on the limits of computerized searching for research at http://www.guild2910.org/searching.htm. Mann’s site might be particularly helpful in discussing computerized searching with students. His example is that the 11,000,000 results for the word “Afghanistan” are unclassified, whereas under the LC system, they are neatly parsed into “Antiquities,” “Bibliography,” “Biography,” “Boundaries,” Civilization,” and so forth. So the argument in favor of LC classification and cataloguing is clear.

On the other hand, it would be foolish to overlook the value of non-classified search results. Matthew’s p0st on machine reading makes clear the value of understanding more about what computers can do. But searching Burney isn’t necessarily clear from the outset. It would be very interesting to hear more about how individuals use search methods within ECCO, EEBO, and particularly Burney. We are grateful to Gale/Cengage for making this collective review possible.

The first point to mention is that the things computers are good at are very different from the things humans are good at. The worthwhile work in digital humanities (“DH” for short, a synonym for computationally assisted humanities research) keeps this fact in mind. Computers are useful for doing quickly certain basic (that is, boring) tasks that humans do slowly. They’re really good at counting, for instance. But sometimes, happily, these kinds of quantitative improvements in speed produce qualitative changes in the kinds of questions we can pose about the objects that interest us. So we literary scholars don’t want to ask computers to do our close reading for us. We want them to help us work differently by expanding what we can read (or at least interpret) and how we can read it. And we want to keep in mind that reading itself is just one (extraordinarily useful) analytical technique when it comes to understanding literary or social-aesthetic objects.

There are two main classes of literary problems that might immediately benefit from computational help. In the first, you’re looking for fresh insights into texts you already know (presumably because you’ve read them closely). In the second, you’d like to be able to say something about a large collection of texts you haven’t read (and probably can’t read, even in principle, because there are too many of them; think of the set of all novels written in English). In both cases, it would almost certainly be useful to classify or group the texts together according to various criteria, a process that is in fact at the heart of much computationally assisted literary work.

In the first case, what you’re looking for are new ways to connect or distinguish known texts. Cluster analysis is one way to do this. You take a group of texts (Shakespeare’s plays, for instance), feed them through an algorithm that assesses their similarity or difference according to a set of known features or metrics (sentence length, character or lemma n-gram frequency, part of speech frequency, keyword frequency, etc.—the specific metrics need to be worked out by a combination of so-called “domain knowledge” and trial and error), and produce a set of clusters that rank the relative similarity of each work to the others. Typical output looks something like this figure from Matthew Jockers’ blog (click the image to see it full size in its original context):

Read this diagram from the top down; the lower the branch point between two items or groups, the more closely related they are.

This may or may not be interesting. Note in particular that the cluster labels are supplied by the user, outside the computational process. In other words, the algorithm doesn’t know what the clusters mean, nor what the clustered works have in common. Still, why does Othello cluster with the comedies rather than the tragedies (or the histories, to which the tragedies are more closely related than the comedies)? The clustering process doesn’t answer that question, but I might never have thought to ask it if I hadn’t seen these results. Maybe I won’t have anything insightful to say in answer to it, but then that’s true of any other question I might ask, and at least now I have a new potential line of inquiry (which is perhaps no mean thing when it comes to Shakespeare).

(As an aside, the extent to which I’m likely to explain the categorization of Othello as a simple error instead of as something that requires further thought and attention will depend on how well I think the clustering process works overall, which in turn will depend to at least some extent on how well it reproduces my existing expectations about generic groupings in Shakespeare. The most interesting case, probably, is the one in which almost all of my expectations are met and confirmed—thereby giving me faith in the accuracy of the overall clustering—but a small number of unexpected results remain, particularly if the anomalous results square in some way with my previously undeveloped intuitions.)

Even more compelling to me, however, is the application of these and related techniques to bodies of text that would otherwise go simply unread and unanalyzed. If you’re working on any kind of large-scale literary-historical problems, you come up very quickly against the limits of your own reading capacity; you just can’t read most of the books written in any given period, much less over the course of centuries. And the problem only gets worse as you move forward in time, both because there’s more history to master and because authors keep churning out new material at ever-increasing rates. But if you can’t read it all, and if (as I said above) you can’t expect a computer to read it for you, what can you possibly do with all this stuff that currently, for your research purposes, may as well not exist?

Well, you can try to extract data of some kind from it, then group and sort and classify it. This might do a few different things for you:

It might allow you to test, support, or refine your large-scale claims about developments in literary and social history. If you think that allegory has changed in important and specific ways over the last three centuries, you might be able to test that hypothesis across a large portion of the period’s literary output. You’d do that by training an algorithm on a smallish set of known allegorical and non-allegorical works, then setting it loose on a large collection of novels. (This process is known as supervised classification or supervised learning, in contrast to the un- or semi-supervised clustering described briefly above. For more details, see the Jain article linked at the end of this post.). The algorithm will classify each work in the large collection according to its degree of “allegoricalness” based on the generally low-level differences gleaned from the training set. At that point, it’s up to you, the researcher, to make sense of the results. Are the fluctuations in allegorical occurrence important? How does the genre vary by date, national origin, gender, etc.? Why does it do so? In any case, what’s most exciting to me is the fact that you’re now in position to say something about these works, even if you won’t have particular insight into any one of them. Collectively, at least, you’ve retrieved them from irrelevance and opened up a new avenue for research.

The same process might also draw your attention to a particular work or set of works that you’d otherwise not have known about or thought to study. If books by a single author or those written during a few years in the early nineteenth century score off the charts in allegoricalness, it might be worth your while to read them closely and to make them the objects of more conventional literary scholarship. Again, the idea is that this is something you’d have missed completely in the absence of computational methods.

Finally, you might end up doing something like the Shakespearean clustering case above; maybe a book you do know and have always considered non-allegorical is ranked highly allegorical by the computer. Now, you’re probably right and the computer’s probably wrong about that specific book, but it might be interesting to try to figure out what it is about the book that produces the error, and to consider whether or not that fact is relevant to your interpretation of the text.

One note of particular interest to those who care deeply about bibliography. In an earlier post about Google Book Search (a service tellingly renamed from the original Google Print), there was some debate about whether GBS is a catalog or a finding aid, and whether or not full-text search takes the place of human-supplied metadata. I think it’s obvious that both search and metadata are immensely useful and that neither can replace the other. One thing that text mining and classification might help with, though, is supplying metadata where none currently exists. Computationally derived subject headings almost certainly wouldn’t be as good as human-supplied ones, but they might be better than nothing if you have a mess of older records or very lightly curated holdings (as is true of much of the Internet Archive and GBS alike, for instance).

Finally, some links to useful and/or interesting material:

The MONK Project. A discussion of MONK, which is an attempt to bring corpus-oriented text analysis to the English-department mainstream, set all this in motion here on EMOB.

Hello to the Early Modern Online Bibliography blog: your discussions here are amazing, and rich with references.

Robert Markley at the University of Illinois and I started 18thConnect — we are co-directors — as a subsidiary organization to NINES (http://www.nines.org) which is incredibly supportive, both financially and in other ways as well. Basically, 18thConnect is an organization that will peer-review digital resources created by 18th-century scholars and then aggregate those resources along with commerical resources.

What does that mean? When you come to the 18thConnect home page, you will be able to search for digital resources among free scholarly resources available on the web that have been judged high quality through peer review, AND commercial catalogs: ECCO, Adam Matthew’s Eighteenth-Century Journals Portal, JSTOR, ProjectMuse, etc. Our finding aid will deliver links to these resources — 18thConnect won’t house them in any way — and then, when you click on a link to an edition of Clarissa, say, proffered by ECCO, if your library subscribes to it and you are logged in at work, you will be sent directly to the resource.

Here is the news for those of you who already know about this initiative: at our summer meeting, July 15, in Dublin, Ireland, at the Royal Irish Academy, Gale consented to give us their page images. We will attempt to machine-read them better, using our own home-made OCR program, in order to produce better plain text files, something closer to the keyed texts produced by the ECCO TCP. Gale will allow us to index the texts that we produce to allow keyword searching on ECCO texts EVEN FOR THOSE PEOPLE WHO DON’T OWN the ECCO catalog. In other words, you’ll be able to find the bibliographic data of the texts containing the keywords for which you search: if your library subscribes to ECCO, you can get the text directly, but if not, at least you now know which texts you’ll have to find through some other means (microfilm, interlibrary loan, visit to special collections).

We are now negotiating with the British Library and ESTC to get that catalog in as well. The Digital Bibliography for English Literature (formerly the NCBEL) will be in soon. We don’t yet have the 18thConnect finding aid up and running: once we have the Gale (ECCO), Adam Matthew (18th-c Journals Portal), DBEL, ESTC data ingested and running smoothly, we will launch: we hope, June 2010.

If you would like to contribute ideas to how this organization should work, you may wish to first take a look at online videos about NINES and 18thConnect available at: