Saturday, July 13, 2013

The Collaborative Future of Amateur Editions

I'm Ben Brumfield. I'm not a scholarly editor, I'm an amateur editor and professional software developer. Most of the talks that I give talk about crowdsourcing, and crowdsourcing manuscript transcription, and how to get people involved. I'm not talking about that today -- I'm here to talk about amateur editions.

So let's talk about the state of amateur editions as it was, as it is
now, as it may be, and how that relates to the people in this room.

Let's start with a quote from the past. This was written in 1996, representing what I think may be a familiar sort of consensus [scholarly] opinion about the quality of amateur editions, which can be summed up in the word "ewww!"

So what's going on now? Before I start looking at individual examples of amateur editions, let's define--for the purpose of this talk--what an amateur edition is.

Ordinarily people will be talking about three different things:

They can be talking about projects like Paul's, in which you have an institution who is organizing and running the project, but all the transcription, editing, and annotation is done by members of the public.

Or, they can be talking about organizations like FreeREG, a client of mine which is a genealogy organization in the UK which is transcribing all the parish registers of baptisms, marriages, and burials from the reformation up to 1837. In that case, all the material--all the documents--are held at local records offices and and archives, who in many cases are quite hostile to the volunteer attempt to put these things online. Nevertheless, over the last fifteen years, they've managed to transcribe twenty-four million of these records, and are still going strong.

Finally, amateur run editions of amateur-held documents. These are cases like me working on my great-great grandmother's diaries, which is what got me into this world [of editing].

I'm going to limit that [definition] slightly and get rid of crowdsourcing. That's not what I want to talk about right now. I don't want to talk about projects that have the guiding hand of an institutional authority, whether that's an archive or a [scholarly] editor.

So let's take a look at amateur editions. Here's a site called Soldier Studies. Soldier Studies is entirely amateur-run. It's organized by a high-school history teacher who got really involved in trying to rescue documents from the ephemera trade.

The sources of the transcripts of correspondence from the American Civil War are documents that are being sold on E-Bay. He sees the documents that are passing through--and many of them he recognizes as important, as an amateur military historian--and he says, I can't purchase all of these, and I don't belong to an institution that can purchase them. Furthermore, I'm not sure that it's ethical to deal in this ephemera trade--there is some correlation to the antiquities trade--but wouldn't it be great if we could transcribe the documents themselves and just save those, so that as they pass from a vendor to a collector, some of the rest of us can read what's on these documents?

So he set up this site in which users who have access to these transcripts can upload letters. They upload these transcripts, and there's some basic metadata about locations and subjects that makes the whole thing searchable.

But the things that I think people in here--and I myself--will be critical about are the transcription conventions that he chose, which are essentially none. He says, correspondence can be entered as-is--so maybe you want to do a verbatim transcript, but maybe not--and the search engines will be able to handle it.

A little bit more shocking is that -- you know, he's dealing with people who have scans--they have facsimile images--so he says, we're going to use that. Send us the first page, so that we know that you're not making this piece of correspondence up completely, fabricating it out of whole cloth.

So that's not a facsimile edition, and we don't have transcription conventions. He has this caveat, in which he explains that this [site] is reliable because we have"the first page of the document attached to the text transcription as verification that it was transcribed from that source." So you'll be able to read one page of facsimile from this transcript you have. We do our best, we're confident, so use them with confidence, but we can't guarantee that things are going to be transcribed validly.

Okay, so how much use is that to a researcher?

This puts me in the mind of Peter Shillingsburg's "Dank Cellar of Electronic Texts", in which he talks about the world "being overwhelmed by texts of unknown provenance, with unknown corruptions, representing unidentified or misidentified versions."

He's talking about things like Project Gutenberg, but that's pretty much what we're dealing with right here. How much confidence could a historian place in the material on this site? I'm not sure.

Here's an example of an amateur edition which is in a noble cause, but which is really more ammunition for the earlier quote.

So what about amateur editions that are done well? This is the Papa's Diary Project, which is a 1924 diary of a Jewish immigrant to New York, transcribed by his grandson.

What's interesting about this -- he's just using Blogger, but he's doing a very effective job of communicating to his reader:

So here is a six-word entry. We have the facsimile--we can compare and tell [the transcript] is right: "At Kessler's Theater. Enjoyed Kreuzer Sonata."

So the amateur who's putting this up goes through and explains what Kessler's theater is, who Kessler was.

Later on down in that entry, he explains that Kessler himself died, and the Kreuzer Sonata is what he died listening to. Further down the page you can listen to the Kreuzer Sonata yourself.

So he's taken this six-word diary entry and turned it into something that's fascinating, compelling reading. It was picked up by the New York Times at one point, because people got really excited about this.

Another thing that amateurs do well is collaborate. Again: Papa's Diary Project. Here is an entry in which the diarist transcribed a poem called "Light".

Here in the comments to that entry, we see that Jerroleen Sorrensen has volunteered: Here's where you can find [the poem] in this [contemporary] anthology, and, by the way, the title of the poem is not "Light", but "The Night Has a Thousand Eyes".

So we have people in the comments who are going off and doing research and contributing.

I've seen this myself. When I first started work on FromThePage, my own crowdsourced transcription tool, I invited friends of mine to do beta testing.

I started off with an edition that I was creating based on an amateur print edition of the same diary from fifteen years previously.

If you look at this note here, what you see is Bryan Galloway looking over the facsimile and seeing this strange "Miss Smith sent the drugg... something" and correcting the transcript--which originally said "drugs"--saying, Well actually that word might be "drugget", and "drugget" is, if you look on Wikipedia, is a coarse woolen fabric. Which--since it's January and they're working with [tobacco] plant-beds--that's probably what it is.

Well, I had no idea--nobody who's read this had any idea--but here's somebody who's going through and doing this proofreading, and he's doing research and correcting the transcription and annotating at the same time.

Another thing that volunteers do well is translate. This is the Kriegstagebuch von Dieter Finzen, who was a soldier in World War I, and then was drafted in World War II. This is being run by a group of volunteers, primarily in Germany.

What I want to point out is, that here is the entry for New Year's Day, 1916. They originally post the German, and then
they have volunteers who go online and translate the entry into English,
French, and Italian.

So now, even though my German is not so hot, I can tell that they were stuck drinking grenade water.

So, what's the difference?

What's the difference between things that amateurs seem to be doing poorly, and things that they're doing well?

This is something that amateurs, in many cases, are not concerned with -- don't know exist -- maybe have never even been exposed to.

So, based on that, let's talk about the future.

How can we get amateurs--doing amateur editions on their own--to move from the things that they're doing well and poorly to being able to do everything well that's relevant to researchers' needs?

I see three major challenges to high-quality amateur editions.

The first one is one which I really want to involve this community in, which is ignorance of standards. The idea that you might actually include facsimiles of every page with your transcription -- that's a standard. I'm not talking about standards like TEI -- I'd love for amateur editions to be elevated to the point that print editions were in 1950 -- we're just talking about some basics here.

Lack of community and lack of a platform.

So let's talk about standards.

How does an amateur learn about editorial methodologies? How do they learn about emendations? How do they learn about these kinds of things?

Well, how do they learn about any other subject? How do they learn about dendrochronology if they're interested in measuring tree rings?

Wikipedia!

Let's go check out Wikipedia!

Wikipedia has a problem for most subjects, which is that Wikipedia is filled with jargon. If you look up dendrochronology, you don't really have a starting place, a "how to". If you look up the letter X, you get this wonderful description of how 'X' works in Catalan orthography, but it presupposes you being familiar with the International Phonetic Alphabet, and knowing that that thing which looks like an integral sign is actually the 'sh' sound.

Now if amateurs are trying to do research on scholarly editing and
documentary editing in Wikipedia, they have a different problem:

So if they can't find the material online that helps them understand how to encode and transcribe texts, where are they going to get it?

Well--going back to crowdsourcing--one example is by participation in crowdsourcing projects. Crowdsourcing projects--yes, they are a source of labor; yes they are a way to do outreach about your material--but they are a way to train the public in editing. And they are training the public in editing whether that's the goal of the transcription project or not. The problem is that the teacher in this school is the transcription software--is the transcription website.

This means that the people who are teaching the public about transcription--the people who are teaching the public about editing--are people like me: developers.

So, how do developers learn about transcription?

Well, sometimes, as Paul [Flemons] mentioned, we just wing it. If we're lucky, we find out about TEI, and we read the TEI Guidelines, and we find out that there's so much editorial practice that's encoded in the TEI Guidelines that that's a huge resource.

If we happen to know the people in this room or the people who are meeting at the Association for Documentary Editing in Ann Arbor, we might discover traditional editorial resources like the Guide to Documentary Editing. But that requires knowing that there's a term "Documentary Editing".

So what does that mean? What that means is that people like me--developers with my level of knowledge or ignorance--are having a tremendous amount of influence on what the public is learning about editing. And that influence does not just extend to projects that I run -- that influence extends to projects that archives and other institutions using my software run. Because if an archive is trying to start a transcription project, and the archivist has no experience with scholarly editing, I say, You should pick some transcription conventions. You should decide how to encode this. Their response is, What do you think? We've never done this before. So I'm finding myself giving advice on editing.

Okay, moving on.

The other thing that amateurs need is community.

Community is important because community allows you to collaborate. Communities evaluate each [member's] work and say, This is good. This is bad. Communities teach each [member]. And communities create standards -- you don't just hang out on Flickr to share your photos -- you hang out on Flickr to learn to be a better photographer. People there will tell you how to be a better photographer.

We have no amateur editing community for people who happen to have an attic full of documents and want to know what to do with them.

So communities create standards, and we know this. Let me quote my esteemed co-panelist, Melissa Terras, who, in her interviews with the managers of online museum collections--non-institutional online "museums"--found that people are coming up with "intuitive metadata" standards of their own, without any knowledge or reference to existing procedures in creating traditional archival metadata.

The last big problem is that there's currently no platform for someone who has an attic full of documents that they want to edit. They can upload their scans to Flickr, but Flickr is a terrible platform for transcription.

There's no platform that will guide them through best practices of editing.

What's worse, if there were one, it would need a "killer feature", which is what Julia Flanders describes in the TAPAS project as a compelling reason for people to contribute their transcripts and do their editing on a platform that enforces rigor and has some level of permanence to it -- rather than just slapping their transcripts up on a blog.

So, let's talk about the future. In his proposal for this conference, Peter Robinson describes a utopia and dystopia: utopia in which textual scholars train the world in how to read documents, and a dystopia in which hordes of "well-meaning but ill-informed enthusiasts will strew the web willy-nilly with error-filled transcripts and annotations, burying good scholarship in rubbish."

This is what I think is the road to dystopia:

Crowdsourcing tools ignore documentary editing methodologies. If you're transcribing using the Transcribe Bentham tool, you learn about TEI. You learn from a good school. But almost all of the other crowdsourced transcription tools don't have that. Many of them don't even contain a place for the administrator to specify transcription conventions to their users!

As a result, the world remains ignorant of the work of scholarly editors, because we're not finding you online--because you're invisible on Wikipedia--and we're not going to learn about your work through crowdsourcing.

So you have the public get this attitude that, well, editing is easy -- type what you see. Who needs an expert? I think that's a little bit worrisome.

The final thing--which, when I started working on this talk, was a sort of wild bogeyman--is the idea that new standards come into being without any reference whatsoever to the tradition of scholarly or documentary editing.

I thought that [idea] was kind of wild. But, in March, an organization called the Family History Information Standards Organization--which is backed by Ancestry.com, the Federation of Genealogy Societies, BrightSolid, a bunch of other organizations--announced a Call for Papers for standards for genealogists and family historians to use -- sometimes for representing family trees, sometimes for source documents.

Here we have what looks like a fairly traditional print notation. It's probably okay.

What's a little bit more interesting, though, is the bibliography.

Where is your work in this bibliography? It's not there.Where is the Guide to Documentary Editing? It's not there.

So here's a new standard that was proposed the month before last. Now, I hope to respond to this--when I get the time--and suggest a few things that I've learned from people like you. But these standards are forming, and these standards may become what the public thinks of as standards for editing.

All right, so let's talk about the road to utopia.

The road to the utopia that Peter described I see as in part through partnerships between amateurs and professionals: you get amateurs participating in projects that are well run -- that teach them useful things about editing and how to encode manuscripts.

Similarly, you get professionals participating in the public conversation, so that your methodologies are visible. Certainly your editions are visible, but that doesn't mean that editing is visible. So maybe someone here wants to respond to that FHISO request, or maybe they just want to release guides to editing as Open Access.

As a result, amateurs produce higher-quality editions on their own, so that they're more useful for other researchers; so that they're verifiable.

And then, amateurs themselves become advocates -- not just for their material and the materials they're working on through crowdsourcing projects, but for editing as a discipline.

So that's what I think is the road to utopia.

So what about the past?

Back in Shillingsburg's "Dank Cellar" paper, he describes the problems with the e-texts that he's seeing, and he really encourages scholarly editors not to worry about it -- to disengage -- [and] instead to focus on coming up with methodologies--and again, this is 2006--for creating digital editions. He says that these aren't well understood yet. Let's not get distracted by these [amateur] things -- let's focus on what's involved in making and distributing digital editions.

The way I've learned about documentary editing, and I suspect lots of other empirical historians have too, is by studying existing editions of sources. We have to look at them and understand them to do our research, and that means taking notice of their transcription conventions. If a document is particularly important and/or we're not entirely sure we can trust the print edition, we might well end up comparing it to the original manuscript, which is easier now we have digital cameras. Sometimes we need to do this because a reading seems unlikely in the context of everything else we've found in our research. This is all part of not trusting anyone entirely, which is important at the highest levels of original research. I'm often concerned that genealogists and amateur military historians trust their sources too much - it must be true cos it says so - and don't always even appreciate the importance of tracing information back to the source - it's just 'known', although many of them are better than that. If they're not critical enough of their sources, then they don't notice things like transcription conventions or editorial intervention.

The other point is, I'm surprised that crowdsourcing projects are mainly managed by developers, but maybe I shouldn't be. Ideally there should be at least two people responsible for a project, with clear separation between them: the IT person who develops and fixes the software and manages the server, and the editorial person who decides transcription and markup conventions, checks the quality of transcripts, trains the transcribers etc. I'm intending that when I launch my new transcription business I'll be offering the second kind of services, if there's any demand for them. It looks like there is a need for it, but the question is, do projects recognise that they need this kind of help, and are they willing and able to pay a professional?

Regarding crowdsourcing projects, the ones I'm thinking of are those launched by libraries, archives, and museums. In those cases, the most common goal is to improve their collections' findability and to engage with their patrons, rather than to produce the kind of editions you'd want to print out and study. Look at institutions using Scripto or the NARA Drupal transcription module for examples of this. On the other hand, the Papyrological Editor, T-PEN and Transcribe Bentham are used by traditional editorial projects, but are far less visible. (FromThePage is used for both, though the projects emphasizing mark-up tend to be behind intranets.)

One of the challenges of the open-source model of distribution for these tools is that, while the tool may be developed within the kind of partnership you describe, it may be installed elsewhere without that environment.

I'll be very interested to follow your experience offering transcription expertise, and wish you the best of luck.

Ben, I agree that the difference between a professional and an amateur is the awareness of standards. These standards are usually domain-specific. Technical standards, such as TEI, should follow and support domain-specific research standards.

Does adherence to standards mean that professionals never make mistakes? No. But it does mean that the research published should be more transparent and reproducible. Holes in reasoning and logic are more obvious when research standards are used. The problem with a lot of amateur research, is not just that it is of poor quality, the problem is that it is opaque, and cannot be built upon by others. It's hard to tell which parts are of poor quality, and which parts, if any, are of worth. This kills collaboration.

As you mentioned Ben, software developers are the teachers of amateur researchers. Yet it seems that many developers, especially those writing software for the family history domain are seemingly oblivious of research standards, or at least they do not write software that is built for standards-driven research.

Is it that they are afraid their users cannot comprehend these standards? Do they think they will they sell more software to unsuspecting amateurs by dumbing-down genealogical research to meaningless trees of names and dates? If they are, they are doing a great disservice to the research community.

I believe that most normal people can understand and can even do research that meets high-quality research standards. Software can be written that is research standards aware. Software can be written to make research more transparent. Software can be written that allows research to speak for itself. Software can be written that teaches amateurs what good research looks like. Software can be written that allows professionals and amateurs to collaborate. Is this an easy thing for developers to produce? No. To write software like this, developers need to have at least a solid understanding of these professional research standards. Experience producing high-quality research would help as well.

When it comes down to it, why would anyone want to spend time and money to produce and publish garbage? It only makes sense, even for a beginner, to learn how to do research right. Software can and should make this learning process as seamless as possible.

Really interesting paper! The Association for Documentary Editing has been looking into reaching out to amateur editors and working with them but from our perspective it is often hard to find them. I think that it would be a great thing to work out what kinds of advice amateur editors are looking for. The ADE runs an annual Institute for Editing Historical Documents where solo editors can come and get hands on experience and also attend the annual meeting. The ADE's website is www.documentaryediting.org

This talk may have become obsolete within a week of its' delivery, as the ADE's release of the Guide to Documentary Editing as an Open Access publication has been an enormous and positive step. Similarly, the comments by Drew and Chris make me think that amateurs are enthusiastic about engaging with professional editors.

As you point out, it's still very difficult to find those amateurs due to the platform and community issues I mentioned above.

I do need to update the blog post to reflect this, as well as perhaps write a new, short post on resources for amateur editors.