I’ve published a piece on Rarity in the Digital Age in RBM: A Journal of Rare Books, Manuscripts, and Cultural Heritage. Because the journal has a very sensible copyright policy (authors retain their own copyrights), I can publish it here as well.

It should be noted that this is actually published here under the terms of terms of the Cambridge UP contract, which, as closed-access contracts go, isn’t a bad one. Its contracts like these that allow me to consider publishing with Cambridge in the future.

An older relative of mine recently asked me to explain the Text Encoding Initiative. I began, as I often do when attempting to explain text encoding to my elders, by making a comparison to the versification and text coloring in the many editions of the Christian Bible, but quickly realized that, for perhaps the first time, I could also safely draw analogies from the idea of eBook readers and the use of semantic tagging by corporations to achieve higher Google ranks. The more I explained, the more I realized this should, in fact, be the glory year of the TEI, but it certainly doesn’t feel that way. In January of 2011, Amazon reported that they sold more Kindle eBooks than paperbacks in 2010. The reading public is increasingly moving from paper to digital media, and the texts consumed on these devices will need to be prepared in a way that allows them to be easily disseminated to a number of devices that, at present, make use of a set of incompatible, occasionally proprietary, encoding formats. The Internet Archive, for instance, releases all of their books in 8 different ebook fromats. Notably, they do not use TEI, nor do they provide tools to generate derivatives from texts that use ours guidelines. Nor do I believe that any of the major ebook publisher use the guidelines as base format for their publications.

2011 also is rapidly becoming, for Digital Humanists at least, the year of the semantic web. From the API workshop at the University of Maryland to the Linked-Open Data for Libraries and Museums conference at the Internet Archive, from the release of the microformats specification to the recent set of linked-open data awards given at the NEH Digital Humanities Startup Grant competition, semantic tagging has moved beyond the niche community of a few information architecture geeks to a central position in the scholarly conversation. Yet, despite the large number of eBook publishers and editors present at the linked open data summit in San Francisco, very few showed any interest in or even awareness of the TEI (despite the fact that, during at least one conversation, relevant TEI work on overlapping hierarchies was directly useful to the problem being discussed).

Finally, 2011 has also been a year when the potential of automatic processing of large corpora has become obvious to even non-digital humanists. Press coverage of the Google N-grams tool, the continuing work of Bamboo, and arguments over the value and legality of the Hathi Trust show the research questions that can be generated even by processing unstructured texts with unreliable metadata.

We are, in fact, at a moment when the TEI should be ascendant. The TEI community’s primary value proposition is that it provides a standardized vocabulary for describing text that is more expressive that the vocabularies of related schema (such as HTML, LaTex, or ePub) that and be used for interchange, corpora processing, software-independent preservation, and, according to some, literary criticism itself. We remain, however, on the fringes of the metadata world, of the non-scholarly text encoding world, and have demonstrated only very moderate success using TEI for large scale data processing. The TEI is, I fear, in danger of becoming a Dvorak keyboard. When the Apple IIc was released, it included a switch, mysterious to many, labeled simply “Keyboard.” For those who didn’t read the manual, the switch’s function seemed to be to change between “gibberish” or “normal” mode. In reality, the switch toggled the Keyboard ROM between the familiar U.S. QWERTY standard and the supposedly more efficient arrangement, the Dvorak Simplified Keyboard. In 1936, education psychologist, August Dvorak, patented a keyboard that, he claimed, arranged the keys in a way that, on modern keyboards, significantly improved efficiency. The QWERTY keyboard, he explained, was intentionally inefficient and separated letters that commonly occurred together in English to prevent adjacent levers from jamming against each other when struck in rapid succession. Better typewriter designs and later, computers, eliminated the problem of jamming keys, and so a more efficient layout would, it would seem, place commonly co-occurring characters in close proximity to each other. Of course, the Dvorak keyboard has never really caught on. By failing to achieve significant buy-in from the most important user communities, the supposedly superior method of text inscribing has become all but irrelevant save for a few niche user groups. The standard narrative of the history of Dvorak claims that it failed for a set of reasons that may be summarized into the following 4 points:

The advantages offered by Dvorak were not sufficient to upset the ubquity of QWERTY.

New technologies that extended the QWERTY standard made switching to a new one impractical

Dvorak was inconsistently implemented (the number keys in many Dvorak keyboard mapping follow the QWERTY arrangement).

The economy of the 1930s did not encourage the development of implementations of the standard

I fear that unless the TEI community and leadership undertake some dramatic and immediate changes, the TEI guidelines are in grave danger of suffering the same fate for many of the same reasons.

To begin, the advantages of TEI are not sufficient to upset the popularity of the text encoding standard with QWERTY-like ubquity, HTML. I acknowledge that the analogy is in some ways imperfect. To begin with, TEI is, in fact, the older standard. Tim Berner’s Lee first publicly release HTML in 1991; the TEI P1 guidelines were published in 1990. The two standards grew up together, but the wider applicability of HTML allowed it to catch on much more quickly to the point that now TEI is the standard that must prove its worth against more familiar HTML.

Today, HTML is one of the most commonly understood computer languages and examples are readily accessible to anyone who learns how to “View source” in their web browser. Humanities students usually pick up the basics within less than an hour of instruction. The tags are very generic and limited, and so can be mastered quickly, but the latest versions of the language are also extensible enough to provide encoders most of the semantic power of TEI. TEI is, undoubtedly, more expressive, but the affordances of the additional power do not, for most outweigh the learning curve and additional complexity of the schema.

A traditional argument against HTML for text encoding claims that it’s focus on presentational rather than descriptive markup limits its use-cases too narrowly. This may have been convincing in 1996, but this position is simply no longer defensible by rational argument. Since HTML 4.0 the presentational elements have been all but entirely deprecated (the “i” and “b” tags for italics have been replaced) in favor of descriptive tagging with CSS. The recent developments in microformats (semantic markup encoded in the attribute tags of standard HTML elements) promoted by Google and Bing through the schema.org structures signify the beginning of what will likely be increased used of HTML for semantic tagging.

All of this points to another example of why TEI cannot win the hearts and minds of the general public or new digital humanists. There is too much invested in HTML as the de facto standard of text markup, and too many new technologies assume its use. Microformats are just one example. The most popular open e-book standard, epub, uses XHTML as its base. The most common massive data process tools, internet search engines, have spent years developing algorithms to parse and process HTML. Fringe standards, like the TEI, must work, with neglible resources, to map their format into ones usable by these tools.

Indeed, even for TEI-based tools, a conversion process is often necessary. TEI may be a standard format, but it is rarely applied in a standard way. We have too many fundamentally redundant tags. When does a block of text lose enough of “the semantic baggage of a paragraph” to switch from a “p” to an “ab.” If a verse line in a play has a stage direction on the same physical line in the source document, should an “lb” tag follow the stage direction? Is so, why does it not follow the rest of the verse lines. Must a parser really check to see if there is an “implied” line break due to the verse line? These are common questions that emerge by those simply trying to apply the canonical guidelines. For those who extend the TEI, even more inconsistencies emerge. Apologists point to the TEI-M standard that managed to wrangle diverse and inconsistently tagged TEI documents into a corpora for text processing, but the fact that this had to be done at all suggests the TEI is not really the best tool for assisting “distant reading.” If it is not, though, then it really should not be recommended by granting bodies for its affordances for interoperability.

This inconsistency is in part due to, and in part the cause of, the paucity of tools that make use of TEI. By 1993 the HTML specification had a popular web browser. We do not yet have a tool comparable to Mosiac or Netscape for the use of TEI files. I understand the argument that any good standard should be independent of a particular, ephemeral tool, but I believe the disappearance of Mosiac and the emergence of four or five major browser (not including the mobile versions) suggests that an interoperable standard actually depends upon the implementation of tools that test and prove it.

This, is, I think the future of the TEI, at least for the near term. The standard is working well enough for the very small subset of scholars and metadata specialists who are actually using it. In these days in which libraries have adopted a standard of “more product, less process,” a standard as process-intensive as TEI must have a product to justify it’s use. In a period in which e-Publishing is now clamoring for better and more functional interfaces for text, we need to show that our standard provides more functionality. I propose, then, a mortatorium on funding for any discussion related to the TEI standard. New tags can wait. Indeed, fewer tags are probably what is needed, but for now, let’s leave that aside. Let us import the whole of our TEI vocabulary (the only thing we have that others don’t have at the moment) and import it into microformats, then build tools that can use it. Some of these tools might be built by specific institutions with project specific grants, but I think we will actually go father if we work together to build tools that belong to no one institution but are built by many. Interedition has already demonstrated the significant coding output that can be achieved simply by bringing programmers together in a room for a few days and only a few thousand dollars. If we were to use the TEI budget to fund these sorts of meetings rather than the committee-based writing that produces the TEI guidelines, if we were to allow instititions to pay their dues with programmer hours dedicated to a collaborative open source project over the course of a year, I suspect our value to the larger humanities enterprise, and perhaps even the general public, would be clearer. As it is, we are presenting endanger of joining the Dvorak keyboard in the archive of irrelevant obscurities. The TEI has not been altogether successful thus far, but it would be a shame if what we have accomplished was simply discarded because our pragmatic competitors convince the public and our funders that we are entirely useless.

Maybe it’s like when I’ve spilled coffee on my shirt, and I’m convinced that everyone is thinking “Why does Doug have coffee stain on his shirt?” when no one actually noticed or cares. The coffee stain, in this instance, is the wordpress.com domain name you’re seeing the address bar up there, and I’ll take the same approach I usually do with coffee stains: draw attention to the embarrassment first so that everyone knows I know. This way I can sit back and preen over how gracefully I demonstrated my own awareness of my flaws and everyone else can go back to not caring.

I’m a computer geek! I can host my own server on the Linux box I bought ten years ago (or at the very least rent some shared server space somewhere). Why am I using a free wordpress account with a default theme?

Simplicity.

And (here’s where I turn this little self-justification into philosophical pontification, watch and learn, dear reader) that’s often really the best reason for any choice where digital technology is involved. In programming projects it’s often tempting to try to give users all the control you have as a programmer, but most users, even expert users, don’t want that freedom. Unless we’re going to spend a lot of time with a piece of software as a central part of our research, we just want to do something quickly and easily. It’s great if it looks good. It’s nice if the functionality extends from simple to advanced control seamlessly (as many Microsoft and most Google apps do), but if a choice has to be made I’d rather have something simple and easy (with the option, like WordPress provides, to download the source code and get really advanced if I need to).

I recently wrote a chapter for Bethany Nowviske’s upcoming book on alternate careers in academia (occasionally known by the Twitter hash tag #alt-ac). I describe a mythical hybrid that, as Napoleon Dynamite might say, “is pretty much my favorite animal.” Ancient writers of Bestiaries discovered this creature in the Septuagint translation of fourth chapter of the Biblical book of Job which observes, in most English translations, “The old lion perisheth for lack of prey.” The translators of the Septuagint, though, were apparently confused by the Hebrew word used for lion and instead coined the idea of the “Μυρμηκολέων” or the “Ant Lion.” Imaginative proto-biologists assumed this must refer to an ant/lion hybrid that, having both herbivorous and carnivorous ancestry, could never find a meal to suit their nature. As a trained and degreed computer scientist with a Ph.D. in English, I find I often identify with this beast. Finding suitable employment in which I can fully satisfy both my programmer and humanities researcher impulses has been challenging to the point I have sometimes feared intellectually starving. I have, however, found the digital humanities and digital library communities to be particularly welcoming and populated with beasts similar to myself (this is the point of my chapter), but I also recognize that many ant-lions are roaming the wilderness still in search of these communities. I intend for this blog to be, among other things, a call across the wasteland of a Google search results page, beckoning other of my emerging species to come and discuss our strange hyrbid scholarship.