Q: Could you give us a quick overview of the project? What were you hoping to get out of crowd-sourcing transcription?

In anticipation of the 2011 Civil War Sesquicentennial celebration, the University of Iowa Libraries reformatted all of our Civil War manuscript holdings, which totaled about 13,000 digital objects. As that project was winding down, we began talking about ways to promote use of the digital collection. Given the amount of handwritten materials we were also interested in making the collection full-text searchable.

We had been admiring efforts such as Zooniverse where “citizen scientists” were transcribing ship logs, Hubble space telescope data and other archival materials to make them machine-readable and available for research. So when the idea of transcribing the Civil War diaries first emerged we started looking into what it would take to make that happen. We were quickly discouraged. We didn’t have the programming capacity to spend time developing a solution or implementing From the Page, which seemed the only potentially viable solution out there when we started looking. Our small group of programmers was already over-taxed. We were going to let go of the idea until our webmaster, Linda Roth, suggested we go the low-tech route and simply collect submissions through the use of a web form. We took her advice and launched this past spring.

Q: In your talk you described a somewhat ingenious albeit rather low-tech work-flow for the project. Could you give us a run through on what that work flow was like and how you decided on it?

Our metadata librarian, Jen Wolfe, created the stick-figure diagrams below. They present a hilarious simple yet accurate accounting of our end-to-end workflow–both when it works well and when it breaks down. To make the interface, Linda wrote some simple PHP code to generate a web page that pulled diary pages from the system we use to host the digital assets. A digitized page image was displayed alongside a text box so users could type in the transcription. When transcriptionists hit ‘submit,’ an email message with the text arrives in our departmental inbox. Staff members review the submission and paste it into the system metadata record. Once the collection is indexed the transcription is live and the text is searchable.

The ideal crowdsourcing transcription workflow…

Such a heavily mediated system is more labor-intensive and results in an asynchronous submission process. One significant drawback is that multiple users could work on the same page simultaneously, resulting in duplication of effort. The fact that the system relies so heavily on “peopleware” instead of software can make us feel self-conscious, especially compared to some of the well-heeled efforts out there. But the reality for us is that we have far more staff able to process the transcriptions than we do to customize software so, in the end, we decided it was worth it if it meant we could attract a lot of new users and get them to engage the materials.

…And when the workflow breaks down

Q: What projects did you look to when you started on this? What lessons did you learn from some of those existing crowdsourcing projects?

We were inspired by international and non-lib efforts, mostly, but also by the New York Public Library’s What’s on the Menu? effort to transcribe historic menus. Grant-funded transcription projects that would lead to software development were just getting underway when we were looking around at models. Today, the Roy Rosenzweig Center for History and New Media offers a promising open-source tool called Scripto. That powers the CHNM project to transcribe 45,000 papers of the War Department.

Perhaps our favorite crowdsourcing pioneers is Zooniverse, a Citizen Science Alliance initiative to crowdsource all kinds of scientific–and now humanities–data. Citizens have been called upon to transcribe ship logs to study weather patterns at sea. The data helps researchers create climate model projections and track ship movements and stories of those on board. Most recently they’ve taken images of Green papyri fragments from the Sackler Library in Oxford and asked volunteers to measure them and transcribe the ancient lettering on the artifact.

These early efforts made us realize that acknowledgement and rewards are an important part of keeping transcribers engaged. The Old Weather project allowed transcribers to move up in “rank” upon completing so many entries. But there were also ways in which we decided to do things differently. We deliberately chose to not require a transcriber login. Other projects, such as Zooniverse and the University College London’s Project Bentham, require users to create accounts. All we require is that text appear in the text box. We invite transcribers to provide a name and address, but it’s not mandatory. We didn’t think users would tolerate barriers to participation.

Rose Holley, of the National Library of Australia, wrote an article in DLib that examines some of the consideration when crowdsourcing and looks at some of the earlier efforts worldwide.

Q: From your talk, it sounds like the project is a resounding success. Could you tell us a bit about how the launch of the project went? What kind of response were you expecting and what kind of response did you get?

088, by AsGood, on Flickr

We weren’t sure what to expect when we launched the project on May 5, 2011. For the first month web usage statistics were modest. But then on June 7, 2011, the American History Association posted the link on its blog.

On June 8, an email with the subject line “Increase bandwidth temporarily?” arrived in our departmental in box. It was a warning that “Reddit.com, an enormous community (millions) of online folks that contribute the best content that others may find interesting, enlightening, funny, or all of the above” found our site and they planned to link to it. The note suggested we allocate more bandwidth to our server for this increased traffic for a least a day or two “or you may risk the site going down temporarily as thousands of interested person are accessing the site being unable to partake and contribute in a meaningful way.”

On June 9, 2011, we went from about 1000 daily hits to our digital library on a really good day to more than 70,000. Our staff busily upped the RAM on the server but the attention brought the whole digital library to its knees. The site went down, transcription input stopped for staff, which could no longer access the digital library to add in the huge batches of transcripts. Only a fraction of volunteer transcribers could access the site. Those transcriptions kept arriving in our email, but with the system down our staff could only watch the pile grow—and grow. It was horrifying—and thrilling.

Those five minutes of fame were fun and came as a huge surprise. But what has proven more satisfying is the sustained interest by many users in the digital collections. We have developed a loyal cadre of transcriptionists, one of whom has transcribed more than 500 pages on her own.

Q: You talked a little bit about how the project got site visitors to interact with your content in different ways. People spent more time on the site and engaged with the content in more rich ways. Could you give us some examples of what you were talking about in that case?

The transcriptionists actually follow the story told in these manuscripts and often become invested in the story or motivated by the thought of furthering research by making these written texts accessible. One of our most engaged transcribers, a man from the north of England, has written us to say that the people in the diaries have become almost an extended part of his family. He gets caught up in their lives, and even mourns their deaths. He has enlisted one of his friends, who has a PhD in military history, to look for errors in the transcriptions already submitted. “You can do it when you want as long as you want, and you are, literally, making history,” he once wrote us. That kind of patron passion for a manuscript collection is a dream. Of the user feedback we’ve received, a few of my other favorites are:

This is one of the COOLEST and most historically interesting things I have seen since I first saw a dinosaur fossil and realized how big they actually were.

I got hooked and did about 20. It’s getting easier the longer I transcribe for him because I’m understanding his handwriting and syntax better.

Best thing ever. Will be my new guilty pleasure. That I don’t even need to feel that guilty about.

Q: Clearly one of the goals of the projects goals was to create the transcriptions. However, with the kind of traffic and attention the project brought in, and the extent to which it invited so much deeper public engagement with the collections strike me as far more valuable. Having people make the kinds of personal connections with the collection seems to me to be far and away more valuable than the transcriptions. In my mind, the transcriptions seem like a nice bonus that came along with an unbelievably successful outreach and education effort. In hindsight, how do you evaluate the success of the project? What do you see as the most important implications of the project?

The connections we’ve made with users and their sustained interest in the collection is the most exciting and gratifying part for me. That’s not to diminish the importance of receiving thousands of “free” transcriptions. These volunteers have brought a level of access to the collection that would have never happened without them.

Another measure of success is what’s happening with donations. One woman who is the great-great-great granddaughter of a Civil War soldier from Iowa just donated several letters to us in response to publicity generated by the project. There have been other similar stories of people coming forward with letters and diaries.

One thing I’m interested in is how to best reach out to our ‘power volunteers,’ those who have done more than 100 entries each. We’ve recognized them on Twitter (@UIL_transcripts) and elsewhere. But what kind of feedback do they have for us, how can we best continue to cultivate and grow these kinds of relationships via the web? We are discussing how to scale this effort beyond Civil War. The appeal of the Civil War is huge, and helped us find an audience. What direction do we go next that will both engage the public and further the access goals of our institution? We’ve just started discussing possibilities and the one that seems to be gaining some traction is a project to transcribe old cookbooks from Szathmary Culinary Arts Archives. We also may pursue a campus partnership to crowdsource children’s diaries from the 1860s through the 1900s that reveal how young settlers 150 years ago recorded their lives.

Q: What advice would you give to someone who was considering a crowd-sourcing project in their library?

Take advantage of crowdsourcing software if you are able, but don’t be afraid to experiment with cloud technologies or other low-tech solutions. Take care of your volunteers. Recognize them and ask them for feedback. Also, don’t let concerns about the accuracy of transcriptions keep you from trying a project. As we piloted the system, some staff members voiced concerns about the project’s premise. Was the public even qualified to do the work? I really like how Sharon Leon, a historian at George Mason University, addressed that question in a New York Times article. “We’re not looking for perfect,” she said. “We’re looking for progressive improvement, which is a completely different goal from someone who is creating a letter-press edition.” It’s interesting that we’re now seeing at least one of the more knowledgeable volunteers coming along and offering corrections to earlier crowdsourced work–further evidence that you should never underestimate the crowd.

One Comment

PLEASE do the Culinary Arts project, my boss is an old food nut (she cooks food from centuries old recipes… or should I say receipts) and would love to work on this, as would I. How does one find out about these projects?

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully
responsible for everything that you post. The content of all comments is released into the public domain
unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless,
the Library of Congress may monitor any user-generated content as it chooses and reserves the right to
remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and
may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's
privilege to post content on the Library site. Read our
Comment and Posting Policy.

Disclaimer

This blog does not represent official Library of Congress communications.

Links to external Internet sites on Library of Congress Web pages do not constitute the Library's endorsement of the content of their Web sites or of their policies or products. Please read our
Standard Disclaimer.