Menu

An enormous amount of academic life happens in digital spaces these days. The microblogging service Twitter, which has been around since 2007, has all but replaced the academic listserv in 2017. Despite the various ways that Twitter continues to struggle with content and user management, it has perhaps accidentally become a widely used professional space for academics to network, exchange ideas, and collaborate. When Twitter works like this, it is brilliant. When Twitter does not work like this, it becomes an incredibly dangerous place for people who are already in precarious positions (for any number of reasons: rank, social identity or identities, job market status, any intersection of the above, etc)

It is becoming increasingly clear to me that academics in general – especially early-stage graduate students – are in desperate need of training, support, & guidance on professional social media spaces. Although more senior scholars are often on social media, they are also secure enough in their positions and academic life in general that their experiences of social media and social networking can be very different than their graduate students.

I’ve been on Twitter since 2010, and I have seen this play out more than a few times, including as a graduate student myself. In these seven years I have maintained what I hope is a very professional profile and I have accidentally amassed a rather large following (in the 1000s). I would not go so far as to say that I am Internet Famous but certainly it is rare I walk into a room now and I don’t know someone there. I try to be very modest about my internet life but I also recognize that is quite difficult when I occupy this space.

People have asked me for years about how I do this. I tweeted yesterday about how I manage my own social media presence, which unexpectedly got a lot of interest. I thought it might be good to have these up more permanently for reference, as the very nature of Twitter is ephemeral (except for when it’s not). I’ve kept these in mostly 140ch-bites, because brevity is often better than verbosity.

1. Don’t tweet anything that I would not want to see associated with me/my name/my likeness in the international news.

2. Mute people you have to follow for whatever political reason but actively dislike. (it happens.) They don’t know you have them on mute.

3. Mute words that you don’t need to see. I’m not a basketball fan so I have “March Madness” on mute. For example.
I use Tweetdeck to mute individual words. This link explains how you do that on Regular Twitter

4. Everything you say on Twitter is public and reaches lots of people you don’t know.

5. 140 characters (or 280, depending on who you are now) is very, very flattening. Assume the recipient will have the worst interpretation.

6. Twitter is a space for networking and making friends, but also your seniors are watching you. They will write your letters of rec one day. (If you are up for tenure, whatever, they are writing those letters too.)

7. Yelling about politics does not make you a better person. but it does make you feel part of a larger culture of dissatisfication. If that makes you feel better, good for you. It can be more performative than anything else.

8. There is an art to being quiet about some things. This one is hard and takes practice.

9. You like the thing you study, so tweet about what you are doing. Be generous about what you know.

10. Give yourself regular days away from Twitter. People will still be there when you come back. Go outside, watch a film, have a life.

In my new job as Digital Scholarship Fellow in Quantitative Text Analysis, I’m starting to work with students and faculty in the Liberal Arts on the hows and ways of counting words in lots of texts. This is a lot of fun, especially because I get to hear a lot about what excites different scholars from across different disciplines and what they think is fascinating about the data they work with (or want to work with).

One thing that is kind of strange about my job – and there are several aspects of my job that have required some adjustment – is that my background is broadly in corpus linguistics and English literature, so I don’t always think of the work I do as being explicitly “DH”. These distinctions are quite frankly boring unless you are knee-deep in the digital humanities, and even then I am not convinced it is an interesting discussion to have. Ultimately, people have lots of preconceived notions about what DH is and why it matters. I suspect that different disciplines within the Humanities writ large have different ideas about this too – certainly the major disciplines I cut across (English, linguistics, history, computer science, sociology) have very different perspectives on the value and experience of digital scholarship. And of course, doing “digital” work in the humanities is kind of redundant anyway: we don’t talk about “computer physics” or “test tube chemistry”, as Mike Witmore and others have pointed out.

Being mindful of this, I have acquired a few rules for doing digital scholarship over the years, and I find myself saying them a lot these days. They are as follows:

1. “Can you do X” is not a research question.

The answer to “can you do X” is almost always “yes”. Should you do X? That’s another story. Can you observe every time Dickens uses the word ‘poor’? Of course you can. But what does it tell you about poverty in Dickens’ novels? Without more detail, this just tells you that Dickens uses the word ‘poor’ in his books about the working class in 19th century Britain — and you almost certainly didn’t need a computer to tell you that. But should you observe every time Dickens uses the word ‘poor’? Maybe, if it means he uses this over other synonyms for the same concept, or if it tells us something about how characters construct themselves as working-class, or if it tells us how higher status characters understand lower-status individuals, or whatever else. These are all research questions which require further investigation, and tell us something new about Dickens’ writing.

2. Programming and other computational approaches are not findings.

So you have learned to execute a bunch of scripts (or write your own scripts) to identify something about your object of study! That’s great. Especially if you are in the humanities, this requires a certain kind of mind-bending that requires you to think about logic structures, understand how computers process information we provide, and in some cases overcome the deeply irregular rules which make your computer language of choice work. This is hard to do! You deserve a lot of commendation for having figured out how to do all of this, especially if your background is not in STEM. But – and this is hard to hear – having done this is not specifically a scholarly endeavour. This is a methodological approach. It is a means to an end, not a fact about the object(s) under investigation, and most importantly, it is not a finding. This is intrinsically tied to point #1: Should you use this package or program or script to do something? Maybe, but then you have to be ready to explain why this matters to someone who does not care about programming or computers, but cares very deeply about whatever the object of investigation is.

3. Get your digital work in front of people who want to use its findings.

Digitally inflected people already know you can use computers to do whatever it is you’re doing. It may be very exciting to them to learn that this particular software package or quantitative metric exists and should be used for this exact task, but unless they also care about your specific research question, there is a limited benefit for them beyond “oh, we should use that too”. However, if you tell a bunch of people in your specific field something very new that they couldn’t have seen without your work, that is very exciting! And that encourages new scholarship, exploring these new issues to those people your findings matter most to. You can tell all the other digital people about the work you’ve done as much as you want, but if your disciplinary base isn’t aware of it, they can’t cite it, they can’t expand on your research, and the discipline as a whole can’t move forward with this fact. Why wouldn’t you want that?

So our solution has been to isolate the texts which are printed in a non-English language, either monolingually (e.g. a book in Latin) or a bi- or tri-lingual text (e.g. Dutch/English book, with a Latin foreword). Looking at EEBO-the-books is a helpful way to identify languages in print, as there are all sorts of printed cues to suggest linguistic variation, such as different fonts or italics to set a different language off from the primary language. It also means I get a chance to look at many of these non-English texts as they were printed and transcribed initially.

Three years ago, I wrote a blog post about some Welsh language material that I found in EEBOTCP Phase I. In the intervening time I still have not learned Welsh (though I am endlessly fascinated by it), still get lots of questions and clicks to this site related to Early Modern Welsh (hello Early Modern Welsh fans), and I have since learned quite a lot more about how texts were chosen to be included in EEBO (it involves the STC; Kichuk 2007 is an excellent read on this topic to the previously uninitiated). So while that previous post asked “What makes a text in EEBO an English text”, this post will ask “what makes a text in EEBO a book?”

In general, I think we can agree that in order to be considered a book or booklet or pamphlet, a printed object has to have several pages. These pages can either created through folding one broadside sheet, or it will have collection of these (called gatherings). It may or may not have a cover, but it would be one of several sizes (quarto, folio, duodecimo, etc). To this end, Sarah Werner has an excellent exercise on how to fold a broadside paper into a gathering which builds the basis for many, but probably not all, early books. Here is an example of a broadside that has clearly been folded up; it’s been unfolded for digitization.

TCPID A17754

So it has been folded in a style that suggests it could be read like a book, but it is not necessarily a book in the sense that there is a distinct sense of each individual page and that some of the verso/recto pages would be rendered unreadable unless they had been cut, etc.

a comprehensive, international union catalogue listing early books, serials, newspapers and selected ephemera printed before 1801. It contains catalogue entries for items issued in Britain, Ireland, overseas territories under British colonial rule, and the United States. Also included is material printed elsewhere which contains significant text in English, Welsh, Irish or Gaelic, as well as any book falsely claiming to have been printed in Britain or its territories.

I select the British Library ESTC here because it covers several short title catalogues (Wing and Pollard & Redgrave are both included) and it’s my go-to short title catalogue database. Including “ephemera” is important, because it allows any number objects to be considered as items of early print, even if they’re not really ‘books’ per se.

Such as this newspaper (TCPID A85603)…

Or this this effigy, in Latin, printed on 1 broadside (TCP id A01919); click to see full-size

Or this proclamation, also printed on 1 broadside (TCPID A09573)

Or this sheet of paper, listing locations in Wales (Wales! Again!) (TCPID A14651); click to see full-size

Or this acrostic (TCPID A96959); click to see full-size

Interestingly, these are all listed as “one page” in the Jisc Historical books metadata, though they are perhaps more accurately “one sheet”. While there’s no definitive definition of “English” in Early English Books Online, it’s becoming increasingly clear to me that there’s no definitive definition of “book” either. And thank god for that, because EEBO is the gift that keeps giving when it comes to Early Modern printed materials.

The following are a list of resources I presented at Yale University (New Haven, CT, USA) on 4 May 2016 as part of my visit to the Yale Digital Humanities Lab. Thank you again for having me! This resource list includes work by colleagues of mine from the Visualising English Print project at the University of Strathclyde. You can read more about their work on our website, or read my summary blog post at this link.

Here are some facts about it: It ended up being 267 pages long; without the bibliography/front matter/appendices it’s 51,399 words long, which is about the length of 2 ⅓ Early Modern plays. There are 5 chapters and 3 appendices. It’s dedicated to Lisa Jardine, whose 1989 book Still Harping on Daughters: Women and Drama in the Age of Shakespeare changed my life, getting me interested in social identity in the Early Modern period in the first place. I am so grateful for Jonathan Hope and Nigel Fabb’s guidance. My examination is scheduled to be within the next two months, which is relatively soon. I’m excited to have it out in the world.

In lieu of an abstract I have topic modelled most of the content chapters for you (I had to take out tables and graphs, and some non-alphanumeric characters; I’m sure you will forgive me). For the unfamiliar, topic modelling looks for weighted lexical co-occurance in text. You can read more about it at this link.

A note on abbreviations: The SSC (“Standardized Spelling WordHoard Early Modern Drama corpus, 1514-1662”, Martin Mueller, ed. 2010) is a subcorpus of Early Modern drama from EEBO-TCP phase I which use to set up comparisons between Shakespeare and his dramatist contemporaries. It generally serves as a prototype for Mueller’s later Shakespeare his Contemporaries corpus (2015). EEBOTCP is Early English Books Online Text Creation Partnership phase I and HTOED is the Historical Thesaurus of the Oxford English Dictionary. Wordhoardis a complete set of Shakespeare’s plays, deeply tagged for quantitative analysis.

A few years ago I wrote about non-English language printing in EEBO, a post which still gets a fair amount of traffic and a lot of people asking me about Welsh. So when I found a bilingual Latin/Hebrew book in EEBO on Friday night while searching for something else just as I was getting ready to go meet some friends for dinner, I was overjoyed. This this is a book printed in Cambridge, England, in 1683 and contains two languages which are very much not English.

JISC’s EEBO portal lists the title as “Komets leshon ha-koresh ve-ha-limudim = Manipulus linguae sanctae & eruditorum : in quo, quasi, manipulatim, congregantur sequentia, I. index generalis difficilorum vocum Hebraeo-Biblicarum, irregularium, & defectivarum, ad suas proprias radices, & radicum conjugationes, tempora, & personas, &c. reductarum” (R1614 Wing), describing it briefly as a Hebrew grammar (with the first four words in the title transliterated from Hebrew). My years of Hebrew school did not leave me a fluent Hebrew speaker or reader; I have no formal Latin or bibliographic training, but this book is really cool. Here are some reasons why…

This isn’t the title page, but it is the introductory material and you can see that it contains Hebrew, Latinate and Greek characters on the same page:

For starters, this is a bilingual grammar and index to the Old Testament, serving in some ways as a precursor to my digital concordances. But it also is fascinating because it involves several different typefaces representing several different languages, so someone in 1683 had either created a typeface for Hebrew or had access to a Hebrew typeface to print this book. Furthermore, Hebrew has a script form and a block letter form; the block letters are often used in printing whereas script is much more common elsewhere. Torahs are hand-copied onto vellum (even today!), so it is plausible someone may have had to transform each scripted character into block letters for this.

Hebrew is read from right to left, whereas Latin is read from left to right, so this book had to be very carefully typeset to put these two languages back to back. It also has a vowel system which is optional in print, but they are usually found under the consonants. Torahs often do not use the vowel system so the inclusion of them here (look for the lines, dots and small T’s) is interesting and an extra complication for typesetting.

The catchwords at the bottom of the page are printed in Hebrew here, but the book uses Latinate numbering. And – as my mother pointed out – entries are listed alphabetically in Hebrew (not in Latin).

It also includes a list of ambiguities, still written in both languages, and still juxtaposed with a left-to-right and right-to-left language.

So this is already interesting from a printing perspective, but then there are also grammatical notes and commentaries included, with descriptions of how to use this grammar. And still the juxtaposition of both languages on the same line is really fascinating:

From the grammatical guide, here is a table of conjugations in Hebrew, marked with Latin descriptions (active, passive, future, participles, etc):

And finally it ends in a two-column translation of Hebrew text into Latin:

The RSA Executive Committee regrets to announce that ProQuest has canceled our subscription to the Early English Books Online database (EEBO). The basis for the cancellation is that our members make such heavy use of the subscription, this is reducing ProQuest’s potential revenue from library-based subscriptions. We are the only scholarly society that has a subscription to EEBO, and ProQuest is not willing to add more society-based subscriptions or to continue the RSA subscription. We hoped that our special arrangement, which lasted two years, would open the door to making more such arrangements possible, to serve the needs of students and scholars. But ProQuest has decided for the moment not to include any learned societies as subscribers. Our subscription will end a few days from now, on October 31. We realize this is very late notice, but the RSA staff have been engaged in discussions with ProQuest for some weeks, in the hope of negotiating a renewal. If they change their mind, we will be the first to re-subscribe.

This is truly terrible news, especially for anyone whose institution did not/could not subscribe to the ProQuest interface.

“We’re sorry for the confusion RSA members have experienced about their ability to access Early English Books Online (EEBO) through RSA. Rest assured that access to EEBO via RSA remains in place. We value the important role scholarly societies play in furthering scholarship and will continue to work with RSA — and others — to ensure access to ProQuest content for members and institutions.”

The RSA subscription to EEBO will not be canceled on October 31, and we look forward to a continued partnership with ProQuest.

Perhaps because the first set of TCP editions of the EEBO texts are now part of the public domain, this is supposed to be sufficient for scholars’ use. Of course, this is not true: the TCP texts are a facsimile of the EEBO images (themselves facsimiles of facsimiles). However inadequate the TCP texts are for someone without an EEBO subscription, I have been collecting a number of links for a number of years about how to access and use EEBO(TCP). Despite overturning this decision, the benefit of having all these resources listed together seems to justify their continued existence here. They are also available on my links page, but in the interest of accessibility, here they are replicated:

On 1 January 2015, 25,000 hand-keyed Early Modern texts entered the public domain and were publicly posted on the EEBO-TCP project’s GitHub page, with an additional 28,000 or so forthcoming into the public domain in 2020. This project is, to say the least, a massive undertaking and marks a massive sea change in scholarly study of the Early Modern period. Moreover, we nearly worked out how to cite the EEBO texts (the images of the books themselves) just before this happened: Sam Kaislaniemi has an excellent blogpost on how one should cite books in the EEBO Interface (May, 2014), but his main point is replicated here:

When it comes to digitized sources, many if not most of us probably instinctively cite the original source, rather than the digitized version. This makes sense – the digital version is a surrogate of the original work that we are really consulting. However, digitizing requires a lot of effort and investment, so when you view a book on EEBO but only cite the original work, you are not giving credit where it is due. After all, consider how easy it now is to access thousands of books held in distant repositories, simply by navigating to a website (although only if your institution has paid for access). This kind of facilitation of research should not be taken for granted.

In other words, when you use digitized sources, you should cite them as digitized sources. I do see lots of discussions about how to best access and distribute (linked) open data, but these discussion tend to avoid the question of citation. In my perfect dream world every digital repository would include a suggested citation in their README files and on their website, but alas we do not live in my perfect dream world.

For reasons which seem to be related to the increasingly widespread use of the CC-BY licences, which allow individuals to use, reuse, and “remix” various collections of texts, citation can be a complicated aspect of digital collections, although it doesn’t have to be. For example, this site has a creative commons license, but we have collectively agreed that blog posts etc are due citation; the MLA and APA offer guidelines on how to cite blog posts (and tweets, for that matter). If you use Zotero, for example, you can easily scrape the necessary metadata for citing this blog post in up to 7,819 styles (at the time of writing). This is great, except when you want to give credit where credit is due for digitized text collections, which are less easy to pull into Zotero or other citation managers. And without including this information somewhere in the corpus or documentation, it’s increasingly difficult to properly cite the various digitized sources we often use. As Sam says so eloquently, it is our duty as scholars to do so.

Corpus repositories such as CoRD include documentation such as compiler, collaborators, associated institutions, wordcounts, text counts, and often include a recommended citation, which I would strongly encourage as a best practice to be widely adopted.

Here is a working list of best citation practices outlined for several corpora I am using or have encountered. These have been cobbled together from normative citation practices with input from the collection creators. (Nb. collection creators: please contact me with suggestions to improve these citations).

This is a work in progress, and I will be updating it occasionally where appropriate. Citations below follow MLA style, but should be adaptable into the citation model of choice.

EEBO-TCP access points:
There are several access points to the EEBOTCP texts, and one problem is that the text IDs included don’t always correspond to the same texts in all EEBO viewers as Paul Schnaffer describes below.

Benjamin Armintor has been exploring the implications of this on his blog, but in general if you’re using the full-text TCP files, you should be citing which TCP database you are using to access the full-text files. Where appropriate, I’ve included a sample citation as well.

Webster, John. The tragedy of the Dutchesse of Malfy As it was presented priuatly, at the Black-Friers; and publiquely at the Globe, by the Kings Maiesties Seruants. The perfect and exact coppy, with diuerse things printed, that the length of the play would not beare in the presentment. London: 1623, Early English Books Online Text Creation Partnership, Phase 1, Oxford, Oxfordshire and Ann Arbor, Michigan, 2015. Available online: http://name.umdl.umich.edu/A14872.0001.001, accessed 5 August 2015.

If you are citing bits of the TCP texts as part of the whole corpus of EEBO-TCP, it makes the most sense to parenthetically cite the TCP ID as its identifying characteristic (following corpus linguistic models). So for example, citing a passage from Dutchess of Malfi above would include a parenthetical including the unique TCPID (A14872).

(Presumably other Text Creation Partnership collections, such as ECCO and EVANS, should be cited in the same manner.)

In my previous post I addressed how to produce a view of many concordance plots at once, and presented concordance plots for twelve vocatives which are indicative of social class in Shakespeare and a larger reference corpus of Early Modern Drama.

After double-checking all the concordance plot files using a hand-numbered master sheet, I normalised the files using the command convert plot*.jpg -size 415x47! plot*.jpg (on the off chance that any files weren’t ultimately the same size), created a new folder of the normalised files, and pulled out the examples which matched the numbers I had for Shakespeare’s plays for further analysis. I hadn’t addressed titles, as I wasn’t really aiming to look at individual authors, so each file is named plot1, plot2, plot234, etc. I went on to compile the results for these plays, felt confident about the fact that I had isolated Shakespeare, and wrote up my previous blog post.

This morning I had a nagging thought: What if those weren’t Shakespeare’s plays? After all, I had broken my #1 rule about using computational methods – assuming that everything at every step of the process worked the way I thought it did. I am probably a self-parodying pendant when it comes to computational methods, because when something goes wrong at some stage in the coding process it may *never* be visible or even noticed in the final output, and this gives me reason to seriously distrust automated processes for analysis.

Ultimately, I decided I would double-check the plays I had deemed to be “Shakespeare”’s. Even though I hadn’t done much automated processing with the image files, I had assumed that the normalisation process would only change the file names to represent a modified version: so that plot10 would become plot0-10, plot 11 would become plot0-11, plot234 would become plot0-234. I had assumed the information in these files wouldn’t change, and the names would correspond to the original files.

This was not true. Instead, I had isolated a very nice sample of 36 plays which I thought matched Shakespeare’s plays in numbering, but turned out to be sampled from throughout the corpus. Matching the sampled “Shakespeare” concordance plots to the master document of concordance plots, I found that I had at least one Middleton play and at least one Seneca play in addition to some (but not all) Shakespeare plays.[1] At this point I was worried, so I re-created Shakespeare’s concordance plots from the master document of concordance plots. By redoing the concordance plots, I could guarantee that these were at least all Shakespeare’s plays in the first instance. Then I normalised them again for size, and went back to see what happens in that process. The first files were a perfect match, as I had hoped. But once I moved to the second concordance plot, I was in trouble.

Below is an image showing the unmodified concordance plot for The Taming of the Shrew (shx2), outlined in red and on the top left-hand side.The other eight concordance plots in this image are normalised for size, and even without great detail you can tell that none of these match the original file. You don’t even need to see the whole image to see this:

In other words, as I had suspected, the names of the normalised files didn’t correspond to the original file names, though they were all there.[2] More worryingly, I hadn’t caught it because I had assumed that the files were fine after running a process on them. The files produced results, and if I hadn’t double-checked (really, at this point, triple-checked), I wouldn’t have caught this discrepancy.

So what do concordance plots for Shakespeare’s plays look like in composite for the vocatives attached to a name in a bigram (reminder of search terms: lord [A-Z]|sir [A-Z]|master [A-Z]|duke [A-Z]|earl [A-Z]|king [A-Z]|signior [A-Z]|lady [A-Z]|mistress [A-Z]|madam [A-Z]|queen [A-Z]|dame [A-Z]) look like? Well, surprisingly, not so different from the sample curated previously, which may be less indicative of a specific authorial style:

Remember, we read these from left to right; now there’s a lot of use of vocatives in the very beginning of the plays, which stay quite strong near the rising action until there’s a relative absence just before and around the climax and the start of the falling action. Curiously, the heavy double hit || towards the end is still very visible, as well as a few more dark lines leading up to the conclusion. In some ways, the absence of these vocatives is almost more consistent, and therefore the white bits are more visible.

In the meantime I’m having a fascinating discussion with Lauren Ackerman about how to best address pixel density and depth of detail (especially in the larger EM play corpus), so maybe there will be a third instalment of concordance plots in the future.

[1] Seneca’s plays were published in the 1550s and 1560s, which is why they are included in this data set of printed plays in Early Modern London.

[2] The benefits of working with a smaller set like this means that there are are much smaller, finite number of texts to address: rather than n = 332332 possible combinations, I was now only looking at a possibility of n = 3636. So that was an improvement. In case you’re wondering what happened to one play, because previously I had claimed there were 37 Shakespeare plays, one play doesn’t have any instances of the vocatives being addressed in a bigram with a capital letter.

What if you could take many concordance plots and layer them to get a composite view of many concordance plots in one image? I wanted to see if vocatives which mark for high-status individuals attached to a name appear in any particular pattern which resembles Freytag’s model of dramatic structure.[1]

I selected 12 vocatives which clearly illustrate social class attached to a word beginning with a capital letter for analysis, all of which are relatively frequent in the corpus of 332 plays comprising of 7,305,366 words. In order to get my concordance plots for vocatives attached to a name, I used regular expressions searching for the vocative in question in a bigram with a capital letter strung together by pipelines, so the resulting search looked like this (signior is spelled incorrectly; this is the spelling which produced hits – I suspect something happened in the spelling normalisation stage):lord [A-Z]|sir [A-Z]|master [A-Z]|duke [A-Z]|earl [A-Z]|king [A-Z]|signior [A-Z]|lady [A-Z]|mistress [A-Z]|madam [A-Z]|queen [A-Z]|dame [A-Z]

Although the regular expression I used picked up examples of queen I and the like, the examples of a capital letter representing the start of a name was far more frequent overall. In the case of mistress, Alison Findlay’s definition (“usually a first name or surname, is a form of polite address to a married woman, or an unmarried woman or girl” (2010, 271) ) accounts for its inclusion here. Though there are certainly complicated readings of this title, I consider instances of mistress to be at the very least a vocative relating to social class in Early Modern England.

The obvious solution to doing this kind of work is R, as people such as Douglas Duhaime and Ted Underwood have been making some gorgeous composite graphs with R for a number of years. To be honest, I didn’t really want to go through the process of addressing a corpus by writing an entire script to produce something that I know can be done quickly and easily in AntConc‘s concordance plot view: I had one specific need; AntConc is an existing framework for producing concordance plots which are normalised for length, as well as a KWIC viewer and several other statistical analyses. I knew that if i wanted to check anything, I could do it easily. I didn’t feel any real need to reinvent the wheel by scripting to accomplish my task, unlike the general DIY process presented by R or Python.[2] The only real downside is that if you want to do more with the output, you have to move into another software package to do that, but even that is not the end of the world.

Ultimately, what I wanted to do was take concordance plots for 332 plays and layer them for a composite picture of how they appear, rather than address them as individual views on a play-by-play basis. Layering images is a common way of addressing edits in printed books; Chris Forster has done exactly that with magazine page size; he suggested I use ImageMagick, a command line processing tool for image compositioning.[3] I have a similarly normalised view of texts at my disposal, as each concordance line is normalized for length. Moreover, Chris and I are of the same mind when it comes to not introducing more complicated software for the sake of using software, so when he told me about this I was willing to give it a try, especially as he has successfully done exactly what I was trying to do. But first I needed concordance plots.

AntConc produces concordance plots but won’t export them, which is annoying but not as annoying as you may think. 38 screen grabs later, I had .pngs of each play’s concordance line. Here they are in AntConc:

(If you’re not used to reading concordance lines, you read them from left to right (from “start” to “finish”, in narrative terms); each | = 1 hit; the more hits closer together, the darker the line will look.)

I turned these screenshots into a very large jpg with the help of an open source image editing program, just to have them all in one document together. The most well-known is probably GIMP but both lifehacker and Oliver Mason offer Seashore as a more mac-friendly alternative to the GIMP.[4]
Then I broke the master document into individual concordance plots, sized 415×47, using Seashore’s really good select-copy-make new document from pasteboard option, which let you keep and move the select box around the master document, as seen below. So far I have only used regular expressions, command-shift-4, copy, paste, save as .jpg, and pen & paper to record what I was doing. Nothing complicated! It took a while, but in the process I got to know these results really well. Not all the plays in the corpus contain all, or in fact, any, of each vocative: in some instances, there are plays that didn’t use any of the above titles, and aren’t included in this output; some plays only use one vocative out of the twelve investigated or any combination of vocatives which do not represent the full twelve.

As a test, I separated out Shakespeare’s plays to see what a bunch of concordance plots looked like in composite. To do this, I opened a terminal, moved to the correct directory, which comprised moving through 6 directories. Then I normalised everything to the same size withconvert plot*.jpg -size 415x47! plot*.jpg, just in case.
I put those in a new folder of normalised images.
Then, from the directory of normalised images: convert plot*.jpg -evaluate-sequence mean average_page.jpg.

Here’s what 12 vocatives for social class in 37 Shakespeare plays look like in composite:
There are a few things that I notice in this plot: There’s a quick use of naming vocatives near the beginning of the plays, a relative absense immediately after, but during the rising action and climax there are clear sections which use these vocative quite heavily- especially in the build-up to the climax. Usage drops in the falling action, until just before the denoument; there is a point where vocatives are used quite consistently heavily, marked by || but surrounded by white on both sides. If you can’t see it, here is the concordance plot again, with that point highlighted in red.

If you repeat the above process for the 332 plays, you get the following composite image. Although the amount of information in some ways obfuscates what you’re trying to see, there are darker and lighter bits to this image.
Most notably, the rising action has a similar cluster of class-status vocative use at the tail end of the introduction and into the rising action, a relative absence until the climax, and then the use of vocatives for social class seem to pick up towards the falling action and end of the plays. Interestingly, the same kind of || notation is visibible towards the conclusion, though it reduplicates itself twice. (Again, if you can’t see it, I’ve highlighted it in red here).

Do vocatives attached to a name which mark for class status have recognizable patterns in dramatic structure? MAYBE.

[1] Matthew Jockers (1, 2) and Benjamin Schmidt have been doing interesting things with regards to computationally analyzing dramatic structure. I’m not going anywhere near their levels of engagement with dramatic arcs in this post, but they are interesting reads nonetheless. (Followup: Annie Swafford’s blog post on Jockers’ analyses are worth a read as well)

[3] Okay, so this required a few more steps of code, most of which were install scripts which require very little work on the human end beyond following directions of ‘type this, wait for computer to return the input command’. If you are on a mac, you will need to get Xcode to download macports to download ImageMagick, and then X11 to display output. X11 seems optional, especially if you keep your finder window open nearby. Setting all this up took about two hours.

[4] It transpired that I could have done this with ImageMagick using the commandconvert -append plots*.png out.png.
Oh well. Seashore also offers layering capabilities for the more graphic design driven amongst you but perhaps more importantly for me, it looks a lot like my dearly beloved MS Paint, a piece of software I’ve been trying to find a suitable replacement for since I joined The Cult of Mac in 2006.