Upload the file to your subdomain (20 pts)
Tutorial: see the tutorials on “Uploading files” and “File manager” here http://docs.umwdtlt.org/domains/ (click on the CPanel tab!!!!)Make sure you upload your files to the correct directory!!! public_html and then whatever folder is the name of your subdomain.

Make sure your page works by going to the link. Edit the file and upload the revised file if necessary.

Post the link to your new HTML page to the assignment on Canvas (this is how I will see if you did the above steps).

Note: if you are having major difficulties, submit your HTML page to canvas, and I will give you some credit, and we can troubleshoot together after Thanksgiving.

Note: folks like Kyle and Ray who already know HTML: try to create a standalone stylesheet for your page using cascading style sheets

Our unit in Intro DH right now is on mapping. In class we’ll be working on creating maps with Palladio. We also had a preliminary introduction to data, tables, and maps by experimenting with Google Fusion Tables. In preparation for class, I imported a data set consisting of a list of images from the Cushman Archive into a few different tools to experiment.

Here is the map of the data in a Google Fusion map:

This is Miriam Posner’s version of the data. She downloaded the data from the Cushman archives site, restricted the dates slightly, and cleaned it up. This data went straight into Google’s Fusion Tables as is. The map shows the locations of the objects photographed. One dot for every photograph. Locations are longitude-latitude geocoordinates.

Then I tried CartoDB. I’ve never used it before, but it’s fairly user friendly for anyone willing to spend some time just playing around and seeing what works and doesn’t work. The first thing I discovered was that CartoDB (unlike Fusion Tables) does not like geocoordinates in one field. In the Cushman dataset, the longitude and latitude were together in one field. But in CartoDB, longitude and latitude must be disaggregated. So to create the following map in CartoDB I first followed the instructions in their FAQ to create separate columns for longitude and latitude. Then I had fun playing with their map options.

This is just a plain map, but with the locations color coded by the primary genre of each photograph (direct link to CartoDB map):

This one shows the photographs over time (go to the direct link to CartoDB map, because on the embedded map below, the legend blocks the slider):

Then I decided I wanted to see if I could map based on states or cities (for example, summing the number of photographs in a certain state, and color-coding or sizing the dots on the map based on the number of photographs from that city or state). So I used the same process to disaggregate cities and states as I used to disaggregate longitude/latitude — I just changed the field names. I noted, though, that for some reason, trying to geo-code by the city led to some incorrect locations. If you zoom out in the map below, you’ll see that some of the photographs of objects in Atlanta, Georgia, have been placed in Central Asia, in Georgia and Armenia. This map represents many efforts to clean the data through automation — simply retelling CartoDB to geocode the cities or states. Didn’t work well.

I also couldn’t figure out a good way to visualize density — the number of photographs from each state, for example. So I downloaded my new dataset from CartoDB as a csv file and then imported it into Tableau (Desktop 9.0). By dragging and dropping the “state” field onto the workspace, I quickly created a map showing all the states where photographs in the collection had been taken:

Then I dragged and dropped Topical Subject Heading 1 (under the Dimensions list on the left in Tableau) onto my map, and I dragged and dropped the “Number of Records” Measure (under the Measures list on the left in Tableau), and I got a series of maps, one for each of the subjects listed in the TSH1 field:

Note that Tableau kindly tells you how many entries it was unable to map! (the ## unknown in the lower right).

Below I’ve Summed by the number of records (no genre, topical subject, etc.) for each state. For this, it’s better to use the graded color option than the stepped color option. If you have just five steps or stages of color, it looks like most of the states have the same number of images, when it is more varied. The graded color (used below) shows the variations better.

This map also shows that the location information for photographs from Mexico was not interpreted properly by Tableau. Sonora (for which there is data) is not highlighted.

Then I decided hey, why not a bubble map of locations, so here we go. Same data as above map, but I selected a different kind of visualization (called “Packed Bubbles” in Tableau).

When I hovered on some of the bubbles, I could easily see the messy data in Tableau. Ciudad Juarez is one of the cities/states that got mangled during import, probably due to the accent:

Finally, a simple map with circles corresponding to the number of photographs from that location. (Again clearly showing that the info from Mexico is not visible. In fact, 348 items seem not to be mapped.)

Obviously the next step would be to clean the data, using Google Refine, probably, and then reload.

In my Introduction to Digital Humanities course, my students are conducting very basic text analysis using Voyant and AntConc. One of the datasets we are using is a set of martyr texts taken from the now public domain Ante-Nicene Fathers series (available at newadvent.org).

I’m a little bit of a skeptic regarding wordclouds; I generally regard them as useful insofar as they are aesthetically pleasing and in that they may spark a deeper interest in a text or set of texts.

Thus, I was pleasantly surprised to see the results of the wordcloud in Voyant. A martyr is a witness, quite literally in Greek. And lo and behold: the most prominent word (after accounting for a standard English stop word list) is “said.” Speaking. Witnessing?

We also put the martyr texts through AntConc, and we tested the Martyrdom of Perpetua and Felicitas against the rest of the dataset to check for key words: just which words were distinctive to Perpetua and Felicitas? Once again I was pleasantly surprised.

AntConc: Keywords in Perpetua and Felicitas measured against other martyr texts in English translation

Note the prominence of “I” and “my” and “me.” The “keyness” of the first person pronouns reflect the presence of a section of the martyr text often called Perpetua’s “prison diary”; according to tradition, the diary was written by Perpetua herself. The keyness of “she” and “her” of course reflect the text’s women protagonists.

In the new window, you’ll see places to add a Title and a slug (the slug is what gets added to the URL so pick one word or maybe two separated by a hyphen, no spaces) (see below)

Type title, slug, and select layout for your page

You’ll also select a layout for content. (above) You probably want either the Gallery or the File with Text. (File with Text allows you to include more than one Item file — no worries). You CAN have more than one content block, so you can choose Gallery first, then File with Text next, or vice versa. If you are confused, just choose File with Text.

Click “Add new content block” after you’ve selected the layout.

Then you’ll see a window to add Items and type in text below. You can add more than one Item (just keep clicking Add Item). Give your item a good caption! For the narrative text: SEE THE HANDOUT on what to write.

Add content to your page

Add a second content block if you want. (Say, for example, you want to have just one image + text and then a Gallery below.)

Add another content block (optional)

Don’t forget to save your changes! Often!

Don’t forget to save your changes!

To get back to this page: login > Exhibits > Edit [your exhibit] > scroll down to see your page and click on it (see below)

To edit your page, go back to the page where you edit the Exhibit and scroll down.

This is the website where we will be creating an online exhibit related to the text you are reading for tomorrow.

Please:

Click through the link to create a password for the site. (Remember your password)

Login to the site

Click on your own name in the upper right (next to “Welcome”) and update your profile: change your user name, public name, and/or email to whatever you want them to be. NOTE: your public name is the name that will appear online associated with content you add to our exhibit. If you don’t want your real name online, change your public name.

Use the collocation feature to see what words are used around m?n and wom?n in the Shakespeare or martyr corpus. What does this tell us about what these texts associate with men vs women? What does this tell us about how men and women are depicted in the corpus? (Experiment with the From… To… settings in the lower right to give yourself a small window around m?n and wom?n and then a larger window to catch more words.)

Use the collocation feature to see how Shakespeare talks about love and death (something we did in Voyant, also). Do you need to use wildcards (such as love* or death*|die*|dead|dying)? Do you get more information than you did in Voyant? Less? Or a different kind? What does it tell you about how Shakespeare talks about love and death? [If love and death don’t interest you, play around with any other key words of interest to you.]

Note: you can click the CLONE RESULTS button to have your results appear in a separate window. You can take screenshots of anything you think is important, also, and print the screenshots. [Don’t worry if you don’t know how to take a screen shot.]

TIP: If you want to look at different corpora and use different tools within AntConc, it might be useful to clear your data out before uploading a new corpus. The “File” menu up on the top menu bar has an option for Clear Tool, Clear All Tools, and Clear All Tools and Files. I often “Clear All Tools and Files” before doing something new.

FOR CLASS THURSDAY: Bring a typed response to these questions. We will discuss them further in class, and in small groups will post to the class blog. You will turn in your typed response in class.

You do not need to spend hours playing around with AntConc. Load the corpora, use the collocation feature, see what you learn and don’t learn.

OPTIONAL [don’t worry if you don’t do this]

Using the martyr corpus, set one text or set of texts as a target in the key words tool and the rest of your corpus as the reference corpus. See the screen capture of the tutorial for more details and see the tutorial online, linked on the syllabus. (For example: load the martyrdom of Perpetua and Felicitas. Then in Settings > Tool Preferences > Key Word List load the whole martyr corpus as the Reference corpus. Then click Apply, and in the Key Word List part of AntConc click “Start.” You should get a list of words that are more unique to Perpetua and Felicitas compared to the whole corpus. What do these Key Words tell you about the Martyrdom of Perpetua and Felicitas compared to the other martyr texts.

Go to the Dashboard (hover over upper left to see Dashboard in a drop down menu)

Go to Your Profile

CHANGE your nickname to something we can all understand, and set your “Display name publicly” as your Nickname.

be sure to save your changes!

Create a test blog post. [Click on the + New link on the top bar of the screen, or click on Posts > Add New on the left menu]. Here is a video about writing a page or a blog post in Word Press. (YOU ARE MAKING POSTS not pages. The beginning is also not relevant — pick up the video at about 58 seconds into the video when it shows you the dashboard.

I spent two weeks at DHSI this year. Week 2 I took Liz Losh’s and Jacque Wernimont’s Feminist DH, which was incredible and I highly recommend to everyone. Check out the #femdh stream on Twitter for details.)

During week 3 of DHSI this year, I took Neal Audenaert’s Topic Modeling, in which we were introduced to using R and then using Mallet in R (following Matt Jockers’ book). I decided to try to topic model the English Revised Standard Version of the Bible, because: 1) I know the material, 2) it was easy to scrape.

I used 1000 character chunks (except for the teeny tiny books like Philemon and some other epistles). And I chose 20 topics (which was too small, but hey, this was my first time out), and Jockers’ stop list (which Neal gave us and I’m guessing is online somewhere). First thing I noticed (besides needing more topics) was that the stop list needs to be expanded. Topic #13 below is basically junk, because of “thee”, “thy”, etc. Thanks to Neal for the help this week!

Here are wordclouds of the top 100 words in each topic. Some make a lot of sense.

This fall, as I have been trying to finish up my book project, Monks and Their Children, I have been asked more than once: What’s your next project? When I start describing copticscriptorium.org, I frequently get the reply: no, I mean your real project, your next book. My internal response was always twofold: the snarky, “What, bringing the study of an entire language into the 21st century is not enough?” and the desperate, “I am not sure I have another monograph in me.” And as the fall wore on, and 2014 became 2015, I became more and more convinced of the authenticity of those sentiments: that digital scholarship in early Christian studies and late antiquity is still not regarded as legitimate as print monographs and articles, and that indeed I had no interest in writing another monograph. It’s not that I thought I couldn’t write another book, but that I just had no desire to spend another decade on a long-form argument. I was more interested in digital writing and digital scholarship that could be read or used by a community more quickly. And in tighter, more focused arguments in essay form.

I also began chafing more and more at the conservatism of the field. The definitions of “real” scholarship, the structural sexism that colleagues like Ellen Muehlberger and Kelly Baker were documenting in academia, and the perception of Egypt and Coptic as marginal areas of study. That conservatism stoked my rebellious fires further; I was not going to force myself to come up with a book project just because that was what one “did” as an active scholar.

And then I saw the CFP for the Debates in the Digital Humanities Series. It’s a call for essays, not monographs, but like Augustine hearing the child chant, “Tolle lege,” I had an epiphany: I damn well had a third book in me. I just hadn’t put the pieces together.

In fact, I have two projects in mind: both are examinations of the field of early Christianity as it intersects (or does not) with Digital Humanities. Both are political and historiographical.

The book (as yet untitled) is about early Christian studies (especially Coptic and other “Eastern” traditions and manuscript collections), cultural heritage, and digitization. Planned chapters are:

Digitizing the Dead and Dismembered. About the material legacy of the colonial dismemberment of archives, the limitations of existing DH standards and technologies (e.g., the TEI, Unicode characterset, etc.) to account for these archives, and how these standards, technologies, practices must transform. The Coptic language and the White Monastery/Monastery of Shenoute manuscript repository will be the primary source examples, but there should be other examples from Syriac, Arabic.

Can the Colonial Archive Speak? Orientalist Nostalgia, Technological Utopianism, and the Limits of the Digital. This chapter will look at the practice of constructing digital editions and digital libraries and (building on the issues discussed in the previous chapter) explore the premise that digitization can “recover” an original dismembered archive such as the White Monastery’s repository. To what extent can digitization recover and reconstruct lost libraries? What are the political and ethical obligations of Western libraries to digitize manuscripts from Egypt and the wider Middle East? Does digitization transcend or reify colonial archaeological and archival practices? This chapter focuses on the concepts of the archive and library and voice. [HT to Andrew Jacobs for inspiring the chapter title.]

Ownership, Open Access, and Orientalism. About the benefits, consequences, and dangers of the open access paradigm for digitizing eastern Christian manuscript collections. Will look at the history of theft of physical text object from monasteries by Western scholars and will ask whether open access digitization is cultural repatriation or digital colonization. Will look at a number of complexities: a) the layers and levels of digitization (metadata, text, images); b) the spectrum of openness and privacy possible; and c) the different constituencies involved in asking the question: whose heritage is this? who owns/owned the text? Church, local monastery, “the world” (as world heritage), American/European scholars who have privileged access to some of these texts already in their libraries or on their computers. Will explicitly draw on insights from indigenous cultural heritage studies related to digitization and digital repatriation.

Transparency and Overexposure: Digital Media and Online Scholarship in Debates about Artifact Provenance. This chapter will examine the extent to which blogs and social media have changed the conversation about the provenance of text-bearing objects we study, and the ethical responsibilities of researchers. Will also look at the risks of online debates, and suggest ways to have constructive conversations moving forward. With special attention to the intersections of status (who’s online and who’s not?) and gender.

The Digital Humanities as Cultural Capital: Implications for Biblical and Religious Studies. Why our field needs to stop treating digital scholarship as derivative or less rigorous, the implications for us being so conservative about digital scholarship as a field, and how Biblical and Religious Studies can contribute to DH as a discipline (not just in content but in concept, in theory, in its very understanding of itself as a discipline or field, in other words, why DH needs Biblical and Religious studies).

Desirable but maybe a stretch: War and the Western Savior Complex: Looks at the rhetoric of crisis and loss (especially in the context of the early 21st c. wars and revolutions in the Middle East) around saving texts, artifacts, and traditions. What does it mean for scholars from Europe and America who are not the policy makers in their countries but are nonetheless citizens of them to be making pleas for the preservation of antiquities and or cultural traditions (and there is —see Johnson’s JAAR article “‘He Made the Dry Bones Live'”— a conflation of ancient traditions and modern Eastern Christian peoples in scholarship and the media) that are endangered in part because of the actions of our governments?

The other project will be digital historiography: using digital and computational methods to crunch Journal of Early Christian Studies (and hopefully its precursor the Second Century?) to look at trends in the field, especially with respect to gender. Who is publishing, what are we publishing on? Who is citing whom? Who is reviewing whom? How has that changed (or not) over the decades? This may be one or two essays, not a book. And it is inspired in part by Ellen Muehlberger’s work micro-blogging statistics on gender in biblical studies book reviews. I’m taking the Topic Modeling course at DHSI this summer and will think more how that or other methods (concordance text analysis, network analysis, etc.) will support this project.

I hope to publish all of this in digital form, including the monograph on cultural heritage and cultural capital.

So that’s my digital future. Of course, first I need to get a couple of other things out the door. And of course Coptic Scriptorium continues. But when you ask me what my next book is about, there you go.

Coming very soon (hopefully in the next week):
blog posts about the texts, tools, and data models
updated TEI XML files of the corpora

Later in 2014 and early 2015:
Shenoute’s Not Because A Fox Barks
More Biblical material, including Greek alignment with the SBL Greek New Testament

Please let us know if you’re interested in contributing to the project! Have Coptic texts you’d like to put in the ANNIS search and visualization tool? Want to annotate any of the documents for biblical references, or something else? Reply to this post or email C. Schroeder carrie [at] carrieschroeder [dot] com

Also: Let us know if you find any errors. We’ll show you how to fork us on GitHub and edit our data!

Schroeder,Caroline T.
University of the Pacific
carrie@carrieschroeder.com

Zeldes,Amir
Humboldt University
amir.zeldes@rz.hu-berlin.de

Abstract

This paper will explain the unique challenges to processing and annotating the Coptic language for digital research and present solutions and methodologies for digital linguistic and historical scholarship.

The Coptic language evolved from the language of the hieroglyphs of the pharaonic era and represents the last phase of the Egyptian language. It is pivotal for a wide range of humanistic disciplines, such as linguistics, biblical studies, the history of Christianity, Egyptology, and ancient history. Whereas languages like Classical Greek and Latin have enjoyed advances made in digital humanities with fully-fledged online research environments accessible to students and scholars (such as the Perseus Digital Library), until recently, no computational tools for Coptic have existed. Nor has an open digital research corpus been available. The research team developing Coptic SCRIPTORIUM (Sahidic Corpus Research: Internet Platform for Interdisciplinary multilayerMethods) is developing and providing open-source technologies and methodologies for interdisciplinary research across multiple disciplines in the Coptic language. This paper will address the automated tools we are developing for annotating and conducting research on a Coptic digital corpus.

Conducting digitally-assisted and computational research in Coptic using available DH resources is complex for several reasons. Most texts are preserved from damaged, incomplete, and dismembered manuscripts or papyri. The DH project papyri.info has begun to create an online open-access resource for the study of Greek papyri and is beginning to digitize Coptic papyri and ostraca (ancient pot-shards with writing). These texts, however, are primarily documentary, consisting of wills, contracts, personal letters, etc. Coptic literary and monastic texts, the core of Coptic SCRIPTORIUM, are essential for the study of the Bible, intellectual history, literary history, and religious history. The manuscripts containing these texts were removed from Egypt in the seventeenth through nineteenth centuries piece by piece (sometimes page by page). Some have been published, many have not, and very few have been digitized in a format suitable for digital and computational work. Texts must be must be reconstructed from pieces of manuscripts published in fragments and/or stored in various libraries and museums worldwide. The status of Coptic literary and monastic complicates metadata management and corpus architecture: what constitutes a “work” – the codex in which a copy of the text appeared (and which may be dispersed across multiple physical repositories)? the manuscript fragment housed in a particular library or museum repository or the work, which only might survive in fragments of multiple codices (all copies of a “book” from the monastery’s library), and thus in fragments not only from more than one codex but also more than one modern repository?

Coptic scholarship still lacks many standards for digital publication and language research that are taken for granted in Greek and Latin. As with other ancient languages, Coptic manuscripts are written without spaces. However, in contrast to its ancient counterparts, scholarly conventions on word division differ substantially from scholar to scholar. Additionally, since Coptic is an agglutinative language, the relevant unit for linguistic analysis is the morpheme, below the ‘word’ level. This means that segmentation guidelines must be developed for both levels of resolution. In order to search multiple texts, guidelines and tools for normalization, part-of-speech tagging and lemmatization of Coptic must be developed. These tools need to take into account Coptic’s agglutinative nature, e.g. normalizing and annotating on the morpheme and word levels.

Finally, the development of the Coptic language during Egypt’s Greco-Roman era raises questions about the origins of the language, its usage in a multilingual context, and the language practices of its ancient speakers and writers. Coptic consists of Egyptian grammar, vocabulary, and syntax written primarily in the Greek alphabet; some Egyptian letters were retained, and some Greek and Latin vocabulary was incorporated into the language. The richness of the vocabulary’s languages of origin varies from author to author, genre to genre. And despite recent publications on the topic, much research remains to be conducted on the extent and nature of multilingualism in late antique Egypt, especially during the fourth and fifth centuries. Additionally, due to the agglutinative nature of the language, one word can be comprised of morphemes with different languages of origin.

This paper will focus on the automated tools our project is developing to process the language, especially tokenizing and part-of-speech annotations. Coptic SCRIPTORIUM has developed the first tokenizer and part-of-speech tagger for the language, and in fact for any language in the Egyptian language family. The presentation will address the unique challenges to processing and annotating the Coptic language. We will present our current technical solutions, their accuracy rates, and the potential for future research. We will also address the ways in which this language’s and corpus’s unique featured differentiate them from other more widely studied ancient languages, such as Greek and Latin. Examples will be drawn from the open-access corpora we are developing and annotating with these tools, available at http://coptic.pacific.edu (backup sitehttp://www.carrieschroeder.com/scriptorium). The Coptic corpora processed and annotated with these tools can be searched and visualized in ANNIS, a tool for multi-layer annotated corpora. We anticipate this presentation to be of interest to scholars in digital humanities working with ancient languages and manuscript corpora as well as DH linguists and corpus linguists.

Many thanks to the people and institutions who made this possible: my collaborator Dr. Amir Zeldes; Tito Orlandi, Stephen Emmel, Janet Timbie, and Rebecca Krawiec who have freely given their labor and their advice; and funding agencies of the National Endowment for the Humanities and the German Federal Ministry of Education and Research. I’m also pleased that this year Rebecca Krawiec along with Christine Luckritz Marquis and Elizabeth Platte will be helping to expand our corpus of digitized texts.

Two years ago, I gave a NAPS paper entitled, “Shenoute of Atripe on the Digital Frontier,” in which I explored – and despaired – of challenges to digital scholarship in early Christian studies, especially Coptic. I posed important questions, such as, “Why don’t my web pages make the supralinear strokes in Coptic appear properly?” and “Why do I only have 50 followers on Twitter?”

I am pleased to report that I have in fact solved both of those problems. And today I want to take you on a tour into the weeds of digital Coptic: how to create a data model for computational research in Coptic; the requirements for visualizing and searching the data; and what you can do with this all once you’ve got it.

Today I’m going to get technical, because over these past two years I’ve come to learn two things:

1. Digital scholarship is about community—the community that creates the data, contributes to the development of standards for the creation of that data, and conducts research using the data. In other words, my work won’t succeed if I don’t drag all of you along with me.

2. The truth is not in there (i.e., you might be thinking, “What is she doing talking about “data” – she did her Ph.D. at Duke in the 1990s?!). In case you came to this paper wondering if I’ve abandoned my Duke Ph.D. post-modern, Foucauldian, patented Liz Clark student cred for some kind of positivist, quantitative stealth takeover of the humanities, well HAVE NO FEAR. The true truth is not in some essentialized compilation of “the data.” As with traditional scholarship, our research questions determine how we create our dataset, and how we curate it, annotate it is already an act of interpretation. (I owe Anke Lüdeling for helping me think through this issue.)

So please, take my hand and take the red pill, not the blue pill, and jump into the data.

Our project is called Coptic Scriptorium, and in a nutshell, it is designed as an interdisciplinary, digital research environment for the study of Coptic language and literature. We are creating technologies to process the language, a richly annotated database of texts formatted in part with these technologies, texts to read online or download, documentation, and ultimately a collaborative platform where scholars and students will be able to study, contribute, and annotate texts. It is open source and open access (mostly CC-BY, meaning that you can download, reuse, remix, edit, research, and publish the material freely as long as you credit the project.

We also invite any of you to collaborate with us. Consider this presentation an open invitation. Our test case was a letter of Shenoute entitled Abraham Our Father, and we’ve since expanded to include another unnamed text by Shenoute (known as Acephalous Work 22 or hereafter A22), some Sahidic Sayings of the Desert Fathers, two letters of Besa, and a few chapters of the Gospel of Mark.

I’ve entitled this paper, “Tagging Shenoute” for two reasons. First, “tagging” refers to the process of annotating a text. To conduct any kind of search or computational work on a corpus of documents, you need to mark them up with annotations, sometimes called “tags.” They might be as simple as tagging an entire document as being authored by Shenoute, or as complex as tagging every word for its part of speech (noun, verb, article, etc.) or its lemma (the dictionary headword for words that have multiple word forms) Second, because the pun with the child’s game of tag was too rich to pass up. The Abba himself disdained children’s play and admonished the caretakers of children in his monastery not to goof around:

As for some people who have children who were entrusted to their care, if it is of no concern to them that they live self-indulgently, joking with them, and sporting with them, they will be removed from this task. For they are not fit to be entrusted with children. It is in this way also with women who have girls given to them.”

And finally, I was inspired by a conversation between two senior Coptic linguists at the Rome 2012 Congress for the International Association of Coptic Studies. When I told them about our nascent project, one replied something along the lines of, “I would not dare to think that Shenoute would allow himself to be tagged!” And the riposte from the other: “And I would not presume to speak for Shenoute!” All of this is to subversively suggest, that despite Shenoute’s own words, he can be fun. While annotation is serious work, there is also an element of play: playing with the data, and pleasure in the text.

The premise of our project is to facilitate interdisciplinary research, to develop a digital environment that will be of use to philologists, historians, linguists, biblical scholars, even paleographers. To that end, we have dared to tag Shenoute in quite a variety of ways:

Metadata: information about text, author, dating, history of the manuscript, etc.

Manuscript or document structure: page breaks, column breaks, line breaks, damage to the manuscript, different ink colors used, text written as superscript or subscript, text written in a different hand….

With hopefully more to come: biblical citations, citations and quotations to other authors, named entities with data linked to other open source projects on antiquity, source language for texts in Coptic translation (e.g., Apophthegmata Patrum and Bible).

We also must be cautious and discerning, on the lookout for the demon of all things shiny and new. As Hugh Cayless writes on the blog for the prosopographical project SNAPDRGN,

“In any digital project there is always a temptation to plan for and build things that you think you may need later, or that might be nice to have, or that might help address questions that you don’t want to answer now, but might in the future. This temptation is almost always to be fought against. This is hard.”

Digital scholarship in Coptic must develop annotation standards in conversation with existing conventions in traditional, print scholarship, as well as digital standards used by similar projects on the ancient world and ancient texts. For Shenoute, this means using as titles of texts the incipits delineated by Stephen Emmel in his book Shenoute’s Literary Corpus, manuscript sigla developed by Tito Orlandi and the Corpus dei Manoscritti Copti Letterari as well as id numbers for manuscripts established by the online portal Trismegistos, and part-of-speech tags based on Bentley Layton’s Coptic Grammar.

In the digital world, in addition to Trismegistos, the emerging standard for encoding manuscript information for ancient papyri, inscriptions, and manuscripts is the subset of the Text Encoding Initiative’s XML tagset known as EpiDoc. The Text Encoding Initiative is a global consortium of scholars who have established annotation standards (including a comprehensive set of tags) for marking up text for machine readability. XML stands for Extensible Markup Language, and is used more widely in computer science, including in commercial software. EpiDoc is a subset of TEI annotations used especially by people working in epigraphy or on ancient manuscripts in a variety of languages. Patristics scholars might be familiar with it because the papyrological portal papyri.info uses EpiDoc markup to annotate its digital corpus.

So, this is a lot of information – what does the data actually look like? Coptic poses some unique challenges.

To get from base text, to annotated corpora, there are a lot of steps: basic digitization of the text, encoding the manuscript information, ensuring Coptic word forms make sense properly, separating those bound groups into morphemes, normalizing the spelling so you can do genuine searching, and tagging for various annotations. I’m going to briefly go through most of these issues.

Before you can even begin to think about tagging, the data must be in a digital format that can be used and searched: typed in Unicode (UTF8) characters and in recognizable word forms. Many of us in this room probably have various files on our computers with text we keyed into Microsoft Word (or dare I say, WordPerfect?) in legacy fonts. We have developed converters ourselves for a couple of different legacy fonts. But keying in the text is only one piece of the puzzle; users of the data must be able to see the characters on their computers, ideally even if they don’t have a Coptic font or keyboard installed, or on their mobile devices.

So we created an embedded webfont that is installed on our website and inside our search and visualization tool. We’ve even embedded a little Coptic keyboard into the search tool, so that you can key in Coptic characters yourself if your device isn’t capable.

Those of you who have studied Coptic know that it is different from Greek or Latin, in that it is an agglutinative language.

Multiple different morphemes, each with different parts of speech, are plugged together like Legos to create “words,” or as Layton describes them, “bound groups.” When you search Coptic, you might not want to search bound groups but rather the individual morphemes within them. That means, when you digitize the text, you need to be attentive to the morphemes and word segmentation. This process of breaking a text into its constituent parts is called “tokenization”; the token is the smallest possible piece of data you annotate. In English texts, it’s often a word.

There are two problems with Coptic.

– First, the concept of words is complex in Coptic

– Second, annotations overlap parts of words. For example, in a manuscript a line might break in the middle of a word.

Here are some examples. When we say Coptic is agglutinative, we mean that what we might think of as “words” are really bound groups of morphemes, as seen here in these two examples.

We’ve color-coded each separate morpheme, so that you can see that each one of these examples is a combination of seven or eight components. To complicate matters, scholars use different conventions to bind these morphemes into words in print editions. We follow Layton’s guidelines for visualizing or rendering Coptic bound groups.

But we also need not only to see or visualize words as bound groups but also to automate the taking apart of these bound groups. Our tools cannot yet handle text as you might see it in a manuscript, with scriptio continua. But, we have automated segmenting bound groups into morphemes, thanks in part to a lexicon that Tito Orlandi graciously gave us, which sped up our work by about a year.

But we need to dig deeper into our data than morphemes, because we might need to annotate on a level that’s even smaller than the morpheme. If you want to mark up the structure of the manuscript – the line breaks, oversized letters, letters written in different ink colors, etc., you need to annotate on the level of parts of morphemes or individual letters.

As in this example, where things that appear in the middle of a morpheme (such as the oversized janja in the middle of the word pejaf) might need to be tagged – size, line break, etc. So you need to annotate on a more granular level than “words” or “morphemes.”

So, now we’ve already got a ton of different ways to tag our data, and we’re not done yet. Lots of other tagging or annotations that you might want to make and use for research. What you do NOT what to have to do is to write this all up manually using actual xml tags in what is called inline markup.

Instead if you markup your data in multiple layers, or what is known as mulit-layer standoff markup, you can make more sense of it and tag your data much more easily.

Here you can see the smallest level of data, the token layer, at the top. The second layer shows the morpheme segments, aligned with the tokens but those two at the end are merged into one, because it is one term – Abraham. Line three gives you the bound groups, line four shows you line breaks. Here you see the line ends in the middle of Abraham. Line 5 shows column breaks, and six page breaks.

Moreover, you want to automate as much of your annotation as possible. We have at least semi-automated normalizing spelling, which eliminates diacritics and supralinear strokes, normalizes spelling variants, deals with abbreviations, and so forth. Normalization is essential both for search and for further automated annotations. We’ve also semi-automated annotations for language of origin of words in a text, and we are developing a lemmatizer, which will match each word with its dictionary head word.

Finally we’ve developed a part of speech tagger, which is is a natural language processing algorithm. It learns as it processes more data, based on patterns and probabilities. We have two sets of tags – coarse, which will just tag all nouns as nouns, for example – and fine – which will tag proper nouns, personal subject pronouns, personal object pronouns, etc.

And so now your data looks like this:

You’ve preserved all your information. By making everything annotations – even spelling normalization – you don’t “lose” information. You just annotate another layer.

So, what can you do with this?

1. Basic search for historical and philological research. Below is a screen shot of the search and visualization tool we are using, ANNIS. ANNIS was developed originally for computational linguistics work, and we are adapting it for our multidisciplinary endeavor. Here I’ve searched for both the terms God and Lord in Shenoute’s Abraham Our Father.

The query is on the upper left, corpus I’ve selected is in the lower left, and the results on the right.

You can export your results or select more than one text to search:

And if you click on a little plus sign next to “annotations” under any search result, you can see all the annotations for that result.

So, noute is a noun, it’s part of this bound group hitmpnoute, it’s in page 518 of manuscript YA, etc.

You can also read the text normalized or in the diplomatic edition of the manuscript inside ANNIS:
Or if you already know the texts you want to read, you can access them easily as stand-alone webpages on our main site (coptic.pacific.edu: see the HTML normalized and diplomatic pages of texts).

2. Linguistics and Style. Here, I’ve told ANNIS to give me all the combinations of three parts of speech and the frequencies those sequences occur:

This is known as a “tri-gram” – you’re looking for sequences of three things. I didn’t tell it any particular three parts of speech, I said, give me ALL sequences of three. And then I generated the frequencies. Note: everything I am presenting here is raw data, designed primarily to GENERATE and EXPLORE research questions, not to answer them in a statistically rigorous way. This is raw data.

What do we learn?

The most common combination of three grammatical categories is the preposition + article + noun (“in the house”) across ALL the corpora – this is #1. Not a surprise if you think about it.

Also, you’ll notice some distinct differences in genre: the second most common tri-gram in the Apophthegmata Patrum is the Past tense marker+the subject personal pronoun+verb –this fits with the Sayings as a kind of narrative piece (3.66% of all combinations). Similarly, for Mark 1-6– the second most common tri-gram is Past tense marker +personal pronoun subject + verb (4.03% of trigrams). Compare that to Besa, where this combination is the 4th most common tri-gram (2.1% of trigrams), or Shenoute, with .91% (A22, 14th most common trigram) & 1.52% (also 4th most common) in Abraham Our Father. (My hunch is this tri-gram probably skews HIGH in Abraham compared to its frequency overall in Shenoute, since there are so many narrative references to biblical events in Abraham Our Father.)

Whereas a marker for Shenoute’s style is the relative clause. Article + noun + relative converter occurs .91% of the time in Acephelous Work #22 and .76% in Abraham. But in Mark, it’s the 33rd most common combination, and occurs .55% of the time. In the Apophthegmata Patrum, it occurs .44% of the time (the 40th most common combination).

Some of you are probably thinking, “Wait a minute, what is this quantitative analysis telling me that I don’t already know. Of course narrative texts use the past tense! And Shenoute’s relative clauses have been giving me conniption fits for years!” But actually, having data confirm things we already know at this stage of the project is a good thing – it suggests that we might be on the right track. And then with larger dataset and better statistics, we can next ask other questions about, say, authorship, and bilingualism or translation vs native speakers. For example: A) How much of the variation between Mark and the AP on the one hand, and Shenoute and Besa on the other can be explained by the fact that Mark and the AP are translations from the Greek? Can understanding this phenomenon – the syntax of a translated text – help us study other texts for which we only have a Coptic witness and resolve any of those “probably translated from the Greek” questions arise about texts that survive only in Coptic? B) Shenoute is reported to have lived for over 100 years with a vast literary legacy that spans some eight decades. Did he really write everything attributed to him in those White Monastery codices? Can we use vocabulary frequency and style to attribute authorship to Coptic texts?

3. Language, loan words, and translation practices. We can also study loan words and translation practices. Quickly let’s take a look at the frequency of Greek loan words in the five sets of texts:

In Abraham Our Father, 4.71% of words are Greek; Mark 1-6: 6.33%; A22: 5.44%; Besa: 5.82%; AP: 4.25%. The texts are grouped on the graph roughly by the size of the corpus – Mark 1-6 is closer in size to Abraham Our Father, and the others are very small corpora. What’s interesting to me is the Apophthegmata Patrum number. Since it’s a translation text, I’d expect this figure to be higher, more like Mark 1-6.

4. Scriptural references and other text reuse. Is it also possible to use vocabulary frequencies to find scriptural citations? The Tesserae project in Buffalo is working on algorithms to compare two texts in Latin or two texts in Greek to try to identify places where one text cites the other. Hopefully, we will be able to adapt this for Coptic one day.

In the Digital Humanities, “distant reading” has become a hot topic. Distant reading typically means mining “big data” (large data sets with lots and lots of texts) for patterns. Some humanists have bemoaned this practice as part of the technological takeover of literary studies, the abandonment of close reading in favor of quantitative analyses that don’t require you ever to actually read a text. Can distant reading also serve some very traditional research questions about biblical quotations, authorship identification, prosopography, or the evolution of a dialect?

This project still has a lot to do. We need to improve some of our taggers, create our lemmatizer, link our lemmas to a lexicon, provide universal references so that our texts, translations, and annotations can be cited, and possibly connect with other linked data projects about the ancient world (such as Pelagios and SNAPDRGN).

For today, I hope to have shown you the potential for such work, the need for at least some of us to dive into the matrix of technical data as willingly and as deeply as we dive into depths of theology and history. And also, I invite you to join us. If you have Coptic material you’d like to digitize, if you have suggestions, if you would like to translate or annotate a text we already have digitized, consider this an invitation. Thank you.

The Coptic SCRIPTORIUM project is pleased to announce that we have been awarded two grants from the National Endowment for the Humanities. A grant from the Office of Digital Humanities will support tools and technology for the study of Coptic language and literature in a digital and computational environment. A grant from the Division of Preservation and Access will support digitization of Coptic texts.

Dr. Caroline T. Schroeder Receives two National Endowment for the Humanities grants

Ann MazzaferroApr 8, 2014

Google has transformed the way we seek knowledge, and most questions can be answered with, “There’s an app for that,” but there are still corners that no search engine or web application have yet reached, among them rare writings in a dead Egyptian language.

With $100,000 in new grants from the National Endowment for the Humanities, Caroline T. Schroeder, associate professor of Religious and Classical Studies at University of the Pacific, plans to change that. Working in collaboration with her project co-director, Amir Zeldes of Humboldt University in Berlin, Schroeder’s goal is to make Coptic accounts of monks battling demons in the desert, early theological controversies, and accounts of life in Egypt’s first Christian monasteries as easy to access online as the morning’s latest news.

“Dr. Schroeder is a distinguished scholar and spectacular teacher, and there is no one more deserving of this prestigious recognition,” said Dr. Rena Fraden, dean of College of the Pacific, the liberal arts and sciences college at University of the Pacific. “Nations – ancient and modern – will always be judged for their contributions to knowledge and the arts. Pacific and Carrie Schroeder belong to this glorious tradition.”

Schroeder received a $40,000 Humanities Collections and Reference Resources grant, which will enable scholars not only to digitize core Coptic texts housed at institutions around the world, but to develop standards for future digitization projects. She also received $60,000 Digital Humanities Start-Up Grant; it will allow scholars to develop the tools and technologies necessary for computer-aided study and interaction with the materials.

The study of Coptic texts has gained attention in recent years, with high-profile controversies including the announcement in 2012 of an apparent Coptic papyrus text that may refer to “Jesus’ wife,” and increased international focus on the political climate of Egypt.

The digitization of these texts, and the database that Schroeder and her colleagues are working to create, will allow students, researchers, and non-academics alike to translate, analyze and understand the content of these Coptic texts, and to cross-reference the material with other texts and resources, including dictionaries and lexicons.

“This is the most cutting edge grant you can get for this type of work,” said Schroeder, who has taught at Pacific since 2007 and is the director of the Pacific Humanities Center. “This is about creating the technology for the study of the humanities. There aren’t that many technologies that work for Coptic or Egyptian texts. It’s an entire language family, and an important one for history, language, art. This is a world cultural heritage, a study of how our world and culture came to be.”

It will also create a centralized, open-source archive where these texts can be accessed in their entirety, anywhere in the world. This is particularly important, as many of these texts have been separated over centuries; reading one letter penned by a Coptic author may mean traveling to several different libraries and museums across the globe to track down the full account. While some Coptic manuscripts have been published in print, others have not.

“There are a lot of materials from this time and place that need more study. You have to know, if you want to read a letter, that some of the pages are going to be in London, some in Naples, some in Paris,” Schroeder said. “These documents and texts are primarily housed in Western museums and libraries, and our project is committed to being open-access, and to being available to everyone, including people in the country where these texts originated.”

This multi-disciplinary project has involved work with scholars from around the world, as well as collaboration among faculty and students at Pacific. Lauren McDermott, an English major with a Classics minor in the College, learned the Coptic alphabet in order to help proofread, digitize, and encode texts; Alexander Dickerson, a Computer Sciences major in the School of Engineering and Computer Sciences, worked on the coding as well.

About the National Endowment for the Humanities
Created in 1965 as an independent federal agency, the NEH supports research and learning in history, literature, philosophy, and other areas of the humanities by funding selected, peer-reviewed proposals from around the nation. For more information, visit www.neh.gov.

About University of the Pacific
Established in 1851 as the first university in California, University of the Pacific prepares students for professional and personal success through rigorous academics, small classes, and a supportive and engaging culture. Widely recognized as one of the most beautiful private university campuses in the West, the Stockton campus offers more than 80 undergraduate majors in arts and sciences, music, business, education, engineering and computer science, and pharmacy and health sciences. The university’s distinctive Northern California footprint also includes the acclaimed Arthur A. Dugoni School of Dentistry in San Francisco and the McGeorge School of Law in Sacramento. For more information, visit www.pacific.edu.

We’ve released several new corpora:
-two fragments of Shenoute’s Acephelous Work #22 (aka A22, from Canons Vol. 3)
-two letters of Besa (to Aphthonia and to Thieving Nuns)
-chapters 1-6 of the Sahidic Gospel of Mark (based on Warren Wells’ Sahidica New Testament)

These corpora include:
⁃ visualizations and annotations of diplomatic manuscript transcriptions (except for Mark)
⁃ visualizations and annotations of the normalized text
⁃ annotations of the English translation (except for some A22 material)
⁃ part-of-speech annotations (which can be searched)
⁃ search and visualization capabilities for normalized text, Coptic morphemes, and bound groups in most of the corpora
⁃ Language of origin annotations (Greek, Hebrew, Latin) in most corpora (which can be searched)
⁃ TEI XML files of the texts in the corpora, which validate to the EpiDoc subset

We’ve also:
⁃ Updated the documentation about our part-of-speech tag set and tagging script. (If you’re interested at all in Coptic linguistics please do read about our tag set)
⁃ Provided some example queries for our search and visualization tool (ANNIS); just click on a query and ANNIS will open and run it
⁃ updated our Frequently Asked Questions document
⁃ released an update to the Apophthegmata Patrum corpus to incorporate some of the new technologies described above
⁃ improved automation of normalizing text, annotating it for part-of-speech, annotating language of origin, annotating word segmentation (bound groups vs morphemes, etc.)

We would love to hear from you if you use our site; we think it will be useful for people teaching Coptic as well as conducting research. Please email either of us feedback directly.

The improvements in automation also mean we would love to work with you if you have digitized Coptic texts that you would like to be able to search or annotate, if there are texts you would like to digitize, or if you would like to annotate existing texts in our corpus in new ways. We are ready to scale up!

Thanks for all of your support. This project is designed for the use of the entire Coptological community, as well as folks in Linguistics, Classics, and related fields.

All the TEI files have been lightly annotated with linguistic annotations.

The metadata has been updated to provide more information about the repositories and manuscript fragments.

There are now TEI downloads for every file in our public ANNIS database.

All TEI files conform to the EpiDoc TEI XML subset and validate to the EpiDoc schema.

The files are licensed under a CC-BY 3.0 license which allows unrestricted reuse and remixing as long as the source is credited (Coptic SCRIPTORIUM). Linguistic annotations were made possible with the sharing of resources from Dr. Tito Orlandi and the CMCL (Corpus dei Manoscritti Copti Letterari); please credit them, as well.

We welcome your feedback on the TEI XML. We hope to release more texts in the corpora later this winter or in early spring.

Some of our most important biblical manuscripts and extra-canonical early Christian literature survive in the Coptic language. Coptic writers are also some of our most important sources for early scriptural quotation and exegesis. This presentation will introduce the prototype for a new online platform for digital and computational research in Coptic, and demonstrate its potential for the detection and analysis of “text-reuse” (quotations from, citations and re-workings of, and allusions to prior texts). The prototype platform will include tools for formatting digital Coptic text as well as a digital corpus of select texts (most specifically the writings of Shenoute of Atripe, who is known for both his biblical citations and his biblical style of writing). It will allow searching for patterns of shared vocabulary with biblical texts as well as for grammatical and syntactical information useful for stylistic analyses. Both the potential uses and imitations of implicit methodologies will be discussed.