Tuesday, August 15, 2017

On August 10, 2017, my partner Sara Carlstead Brumfield and I delivered this presentation at Digital Humanities 2017 in Montreal. The presentation was coauthored by Patrick Lewis, Whitney Smith, Tony Curtis, and Jeff Dycus, our collaborators at Kentucky Historical Society.This is a transcript of our talk, which has been very lightly edited. See also the Google Slides presentation and m4a and ogg audio files from the talk.

[Ben] We regret that our colleagues at
the Kentucky Historical Society are not able to be with us; as a
result, this presentation will probably skew towards the technical.
Whenever you see an unattributed quotation, that will be by our
colleagues at the Kentucky Historical Society.

The Civil War Governors of Kentucky
Digital Documentary Edition was conceived to address a problem in the
historical record of Civil War-era Kentucky that originates from the
conflict between the slave-holding, unionist elite with the federal
government. During the course of the war, they had fallen out
completely. As a result, at the end of the war the people who wrote
the histories of the war—even though they had been Unionists—ended
up wishing they had seceded, so they wrote these pro-Confederate
histories that biased the historical record. What this means is that
the secondary sources are these sort-of Lost Cause narratives that
don't reflect the lived experience of the people of Kentucky during
the Civil War. So in order to find about that experience, we have to
go back to the primary sources.

The project was proposed about seven
years ago; editorial work began in 2012 – gathering the documents,
imaging them, and transcribing them in TEI-XML. In 2016, the Early
Access edition published ten thousand documents on an Omeka site,
discovery.civilwargovernors.org. Sara and I became involved around
that time for Phase 2.

The goal of Phase 2 was to publish 1500
heavily annotated documents that had already been published on the
Omeka site, and to identify people within them.

The corpus follows the official
correspondence of the Office of the Governor. As Kentucky was a
divided state, there were three Union governors during the Civil War,
and there were also two provisional Confederate governors.

Fundamentally, the documentary edition
is not about the governors. We want to look at the individual people
and their experience of war-time Kentucky through their
correspondence with the Office of the Governor. This correspondence
includes details of everyday life from raids to property damage, to –
all kinds of stuff: when people had problems, they wrote to the
governor when they didn't know where else to go.

If we're trying to highlight the people
within these documents, how do you do that within a documentary
edition? In a traditional digital edition, you use TEI. Each
individual entity that's recognized—whether it's a person, place,
organization, or geographic feature—will have an entry about it
created in the TEI header or some external authority file, and the
times that they are mentioned within the text will be marked up.

Now when done well, this approach is
unparalleled in its quality. When you get names that have correct
references within the ref attribute of their placeName tags --- you
really can't beat it. The problem with this approach is that it's
very labor-intensive, and because it's done before publication, it
adds an extra step before the readership can have access to the
documents.

The alternative approach which we seen
in the digital humanities is the text mining approach, in which
existing documents have Named Entity Recognition and other machine
learning algorithms applied to them to attempt to find people who are
mentioned within the documents.

Here's an example, looking for places,
people, or concepts within a text.

The problem with this approach—while
it's not labor-intensive at all—is that it doesn't produce very
good quality. The third Union governor of Kentucky, Beriah Magoffin,
appears with one hundred and seven variants within the text that have
been annotated so far. These can be spelling variants, they can be
abbreviations, and the there are all these periphrasitic expressions
like “your Excellency”, “Dear Sir”, “your predecessor”
(in a letter to his successor). So there is all kind of variation in
the way this person appears [in the text].

Furthermore, even when you have
consistency in the reference, the referent itself may be different.
So, “his wife” appears in these documents as a reference to eight
different people. What are you going to do with that? No clustering
algorithm is going to figure out that “his wife” is one of these
eight people.

Our goal was to try to reproduce the
quality of the hand-encoded TEI-XML model in a less labor-intensive
way.

[Sara] So how do we do that? We built
a system, called Mashbill, for a cadre of eight GRAs, each assigned
150 documents from the corpus, who used a Chrome plug-in called
hypothes.is to highlight every entity in the published version of the
documents. So the documents are transcribed and published, and the
GRAs highlight every instance of an entity.

If we look at the second
[highlight] “Geo W. Johnson, Esq”, they highlight it, and then
they use Mashbill, where we use the hypothes.is API to pull in all of
their verbatim annotations.

Each GRA sees the annotations they have
created, and [next to each] is an “identify” button. This pulls
the verbatim text into a database search using Postgres's trigram
library to look for closest matches within our database of known
entities.

“Geo W. Johnson, Esq” has the
potential to match a lot of people—mostly based on surname. It
looks like it might be the second one, George Johnston (a judge), but
probably it's George Washington Johnson—halfway down the page—who
was one of these provisional Confederate governors. The GRA would
choose that to associate the string with the entity in the database,
but if they couldn't find an entity—remember that the goal is to
find all the people in the corpus who are not already known to
historians—they have the ability to create an entity record.

When you create a new entity or when
you're working with an entity, we flesh out a lot of really rich
information about that entity within the tool. The GRAs would fill
in attributes from their research into a set of approved references
for Kentucky in this period, including dates, race, gender,
geographic location (latitude/longitutde). We also get short
biographies which will be incorporated into the edition, and also a
list of documents [mentioning the entity].

Once you have the information, you can
do a lot with the entities really quickly. We can do rich entity
visualizations: the big dot is people, places, organizations and
geographic features; you can look at gender of entities within the
corpus; we can look at entities that appear more often than others
and who they are. You can do a lot of high-value work with the data.

We can also look at documents and the
places that they mention – large dots are places that are mentioned
more often in the documents.

[Ben] The last stage of this, is—once
the entity research is finished and once the annotations for the
document have all been identified—the Mashbill system will produce
a TEI-XML file for every entity. It will also update the existing
TEI documents that were created during the transcription process with
the appropriate persName, placeName, orgName tags with references to
[the entity files]. It will also automatically check those files
into Github so that the Github browser interfaces will display the
differences between [the versions].

So we end up with an output that is
equivalent to a hand-coded digital edition which is P5-compliant TEI,
but which we hope takes a little bit less labor.

If we're trying to look at
relationships between people in this corpus, we need to define those
relationships. One traditional method—which we saw earlier in [François Dominic Laramée's presentation, "La Production de l’Espace dans l’Imprimé Français d’Ancien Régime : Le Cas de la Gazette"]—is coocurrence: trying to identify entities that are
mentioned within a block of text. Maybe [that block] is a page,
maybe it's a paragraph, maybe it's a sentence or a word window.

But coocurrence has a lot of
challenges. For example, [pointing] if we look right here at “our Sheriff” (who is identified as Reuben Jones, I think) is mentioned
within the same paragraph as these other names. But the reason he's
mentioned is – it's just an aside: we sent a letter via our
sheriff, now we're going to talk about these county officers.
There's no relationship between the sheriff, Reuben Jones, and the
officers of the county court. The only relationship that we know if
is between Reuben Jones and the letter writer – and that's it.
Coocurrence would be completely misleading here.

[Sara] So what do we do instead?

Once you've identified the entities
within the text, the next step in the Mashbill pipeline is to define
the relationship that you're seeing. Those might be relationships
that are attested to by the document itself, or they might be
relationships that the GRAs found over the course of their research
for the biographies.

Mashbill displays a list of all the
entities that appear in a document, and the GRAs choose relationships
for those entities based on their research. We have six different
types of relationships—social, legal, political, slavery,
military—and we also prompt the GRAs, showing them what we already
know about the relationships of the [entities mentioned within a
document].

So we have richer relationship data
than a lot of traditional computational approaches, which means that
you can do visualizations which have more data encoded within them,
and can be more interesting.

This is Caroline Dennett, who was an enslaved woman who was brought [to Kentucky]
as contraband with the Union Army, was “employed” by a family in
Louisville, and was accused of poisoning their eighteen month old
daughter. There are a lot of documents about here, because there are
people writing to the governor about pardoning her, or attesting to
her character (or lack of ability to do anything that horrible).

What we did with our network is not
just Caroline and all the people and organizations she was related
to, but rather we have different types of relationships. We have
legal relationships, political relationships; we have social
relationships. So a preacher in her town was one of the people who
wrote to the governor on her behalf, so we show a social relationship
with that person. We have about three different types [of
relationships] displayed in different colors on this graph.

What are our results?

As of a week ago, this project had
annotated 1228 documents with 15931 annotations. Of those
annotations, 14470 have been identified as 8086 particular entities.
On our right [pointing], we have the distribution of annotations on
documents: some of them, like petitions have as many as 238 names,
but our median is around eight entities named per document.

You can find the project at
civilwargovernors.org. That's the Early Access version which is just
the transcriptions; by October those will be republished with all the
biography data and the links between the documents and the entity
biographies.

The software is on Github. I'm Sara
Brumfield, this is Ben Brumfield; we're with Brumfield Labs. Patrick
Lewis is the PI on this project, Whitney Smith, Tony Curtis, and Jeff
Dycus are editors and technologists at the Kentucky Historical
Society. We also want to thank the graduate research assistants.

[applause]

Many questions were very faint in
the audio recording; as a result, the following question texts should
be regarded as paraphrase rather than transcripts.

Question: You mentioned the project's
goals of trying to get beyond a pro-slavery, pro-Confederate historical
record. Do you have an idea of how that's going?

Answer:[Ben] What we find is that the
documents skew male; they skew white. So it's not like we can create
documents that don't exist. But what we can do now is identify
documents and people, so you can say “Show me all the women of
color who are mentioned within the documents; I want to read about
them.” So at least you can find them.

Question: Despite the workflow and
process, it seems like there are still a lot of hours of labor
involved in this. Can you give us an idea of the amount of labor
involved in this project, outside of building the software?

Answer:[Sara] The budget for the labor
was $40,000, which hired eight GRAs for the summer. [Ben] They're
not done yet, but we think they will achieve the goal of 15,000
entities. It's hard to tell the difference between this and a TEI
tagging project, in part because—in addition to identifying
entities—every single entity had to be researched, and a biography
had to be written for them if possible. [Sara] That's obviously
labor-intensive. From a software perspective, we tried to think
really hard about how to make this work go faster. So using
hypothes.is for annotation: hypothes.is is really slick, and we also
didn't have to build an annotator, so that keeps your costs of
software development down. So that went really fast. Trying to
match entities to choose; we tried to do a lot of that sort of work
to make the GRAs as effective as possible. [Ben] But they still have
to do the research; they still have to read the documents.

Question: All of your TEI examples
focus on places – were you able to handle other kinds of entities?

Answer:[Ben] We concentrated on
people, places, and organizations, but one interesting thing about
this approach is that—if you look up here at entities mentioned
more than ten times, and I'm sorry there's no label—the largest red
blob and the largest blue blob are both Kentucky. One of them is the
Government of Kentucky; the other is Kentucky as a place. Again,
humans can differentiate that in a way that computers can't. [Sara]
We did organizations, people, places, and geographic features .

Question: This is a fantastic resource
for not just Kentucky Historical Society, but also in terms of
thinking through history in the US. I was wondering what your data
plan was, and how available and malleable is the data that you
produce.

Answer:[Sara] The data itself is
flushed to Github as TEI documents, so every entity will have a
document there, as well as every document. The database itself is
not published anywhere. [Ben] Our goal with this was that, by we got
to the “pencils down” phase of the project, everything was
interoperable, in Github, so that people could reconstruct the
project from that, and that no information was lost – but that's
the extent of it.

Question: A technical question – I
missed the part with Github. How does that work?

Answer:[Ben] So the editors were looking for
a way of exposing the TEI for reuse by other people. Doing all this
work on TEI, then locking it away behind HTML is no fun. That said,
they were not that happy with—and they loved the idea of Github as
a repository; we had used it before for the Stephen F. Austin papers
as a raw publication venue—they were really not comfortable with
their graduate research assistants having to figure out how git
works, and what do you resolve merge conflicts, and such. As a
result, Mashbill—the Ruby on Rails application that we built—every
time there's a change to a document or an entity, it does a checkout
and merge, finds the TEI, adds the tags—essentially merges all that
data in—and then checks that back into Github. As a result [the
GRAs] are able to use the Github web interface to see the diffs and
publish the data, but they didn't have to actually touch git. [Sara]
Right, but the editors might, if they need to.

About us: Brumfield Labs, LLC is a software consultancy specializing in digital editions and adjacent methodologies like crowdsourced transcription, image processing/IIIF, and text mining. If you have a project you'd like to discuss, or just want to pick our brains, we'd love to talk to you. Just send a note to benwbrum@gmail.com or saracarl@gmail.com and we'll chat.