Présentation

The foxglove (digitale in French) is a flower which grows in difficult situations, such as the crevices of old walls or rocks. It is grown commercially for life saving medecine, and yet is also a celebrated poison. In this blog, I will try unsuccessfully to avoid too many bad puns, but this convergence of digital(is) and digitalisation is fascinating, and a useful trope to launch my glorying in the linguistic and organisational consequences of the collision between the digital and the humanities in France, seen from a British perspective.

Administration

Mois : juin 2011

Here’s a nice way of spending a day in the heart of the Marais. Get together a bunch of people who do actually use the TEI (or some other kind of structured XML markup) to do cool things and ask them to talk for a maximum of 10 minutes each about the software they use and what they do with it. I claim no credit at all for this idea: the event was master minded by Anais Wion, Fabrice Melka, and Denise Ogilvie who just coincidentally have to prepare a workshop on the verb « exploiter » in Aussois later this year. Whatever its origins, this turned out to be a really worthwhile day, and not just because of the venue (the alabaster hall of the Archives Nationales) or the lunch (yum, Lebanese buffet).

A proper account of the proceedings has been promised for a couple of weeks hence, so this note is just the consequence of me jotting down some immediate impressions on the train home. There is already a useful page of links to stuff mentioned at the workshop at http://www.delicious.com/workshopexploiter, which I should probably update with this report.

I kicked off by explaining why the TEI really didn’t ought to have much to do with software production, except for its own nefarious purposes. I conceded, however, that those purposes led inelectably led to the production of Sebastian’s Excellent Stylesheets and hence to a generic software tool of some importance in the community. Marjorie Burghart then talked about XML database eXist, showing it in action on her sermones.net site, and also her paleographic exercise site; the main problem with it, for her, was that its installation and maintenance on a local server require a little more technical expertise (for example, fine tuning a java environment, recovering tomcat when it falls over, etc.) than is available for the typical humanities department. This need for infrastructural computing support turned out to be a major theme of the day. Next up was Lauranne Bertrand from the CESR team at Tours, who showed how they currently use XTF to display various versions of their richly encoded texts. Maud Ingaro then introduced us to a new XML database from the University of Konstanz called BaseX which seems worth a second look, if only for its very sparkly visualisation features, though its main claim to fame is probably its ability to handle REALLY BIG (multi-gigabyte) databases, which (if true) should give several current pontificators pause for thought. Jorge Fins, also from CESR then talked about Philologic, which provides traditional text searching (full text indexing, concordancing, etc) capabilities, running on a distinct (and distinctly dumbed down) copy of the Bibliotheques Virtuelles des Humanistes exported to Chicago.

After a brief pause for coffee, Alexei Lavrentev, standing in for Serge Heiden (reportedly recently immobilised by a close encounter with a crampon) showed us the current state of txm the open source text analysis system developed by the textometrie project at Lyon. Severine Gedzelman, also from Lyon, then described Hypermachiavel, an application for handling multiple aligned corpora (or, to be more exact, one specific set of multiple aligned corpora). I found the difference in software design between these two projects interesting: txm was developed very consciously as a generic text processing framework, incorporating and rationalising feaures from many other systems; whereas hypermachiavel was developed (almost from zero) very much to meet the specific needs of a particular research project, but without any particular generic intention.

Does the world need another generic tool for doing textual annotation in XML? Certainly many linguists and computer scientists seem to think so. Cue Antoine Widlocher from the University of Caen, and Glozz, a new plateform for distributed linguistic annotations of text segments, overlapping or otherwise, relationships, graphs, etc. etc. Very nice visualisations, as per other java applications; nice features such as annotation histories; no evidence that any researchers from the humanities had been involved in its design or application up to now. Florence Clavaud, from the Ecole Nationale des Chartes, then spoke very briefly (no really) about Pleade and her plans to enhance this mainstream EAD-muncher to include TEI capabilities. Pleade is one of the tools of choice in the French Archival community so that enhancing it to handle TEI as well as it currently manages EAD and sets of digital images would be very cool. Also from ENC, Vincent Jolivet and Frederick Glorieux showed us diple which is a nice simple package written in php to transform complex TEI markup to static web pages with a complementary suite of stylesheets to render them, and something called xrem, a very glamorous tool for visualisation and construction of RELAX NG schemas. Fred likes to work directly in RELAX NG rather than via ODD, but the results almost justify such heresy. Nicole Dufournaud, aided and abetted by Denise Ogilvie, told the (possibly) instructive history of how Millefeuille (a nice customized TEI editing and indexing application based on work Nicole pioneered back in the nineties) is now in a suspended state of animation. Following one unsuccessful attempt at reanimation, it appears that another one is proposed as part of a European project. Finally before lunch, Maud Ingaro showed us some camstudio videos about dinah : this « philological platform for the construction of multi-structured documents » is currently being developed at Lyon in a project studying the manuscripts of Jean-Toussaint Desanti, and seems worth a second look, even though it’s a long way from being stable yet.

After the afore-mentioned very nice lunch, there was a wide-ranging free-form discussion, from which I took away chiefly the following points (as aforesaid, there will be a more complete and correct report later):

a general feeling that IT infrastructural support was lacking: in particular, people wanted

some kind of sandpit environment in which they could experiment with different tools

some easily accessible web-publishing service for e.g. doctoral students to showcase their work

a general feeling that development and implementation of XML-based projects was hard work requiring input from specialists, consequently a need for more training

a desire to share experience of these and other tools; the existence of TEI-FR, and the TEI Tools SIG were agreed to be appropriate channels.

Some pointed requests were made for the TGE to do more to provide some of these services, which proposal I agreed to go away and investigate.

The AGORA project (this one, not to be confused with this other one nor even this other one again) has defined a very simple TEI XML schema for scholarly publishing. In this series of blog entries, I report my attempts to process a set of documents which conform to that schema into PDF and other formats, using the TEI stylesheet library. My environment is a laptop running Ubuntu 10.04, on which I have installed the 5.1.4 release of the tei-xsl package and most of the texlive Ubuntu packages (versions dating from July 2009 according to dpkg).

On the train to London this morning, I wrote a Makefile which validates each file and, if valid, then processes it using the teitolatex and xelatex commands. This produced something not entirely discouraging, with the following obvious things to fix:

some of my files had numbered headings and others didn’t. By default the stylesheets added numbers willy nilly. I need to switch this behaviour off.

some of my files used <byline> in the header to indicate the affiliation for an author, like this:

By default, the stylesheets clearly have no idea what to do with the text fragment following the <docAuthor>, and therefore spit it out on a page of its own.

I learned at the excellent MUTEC workshop last week that the recommended way of modifying these stylesheets is to set up a new « profile », so I duly visited the directory /usr/share/xml/tei/stylesheet/profiles and created a new folder there called /usr/share/xml/tei/stylesheet/profiles/agora (somewhat to my surprise this did not require root access). I then copied the existing default specifications for each of the target transformations I thought I might use in my Agora work into this folder. Like this:

The directory names (latex, docx, etc.) are not particularly well publicized: I worked out by inspection that « oo » must be the one invoked by the command « teitoodt »… presumably at some point it will be renamed Liboff vel sim.

Anyway, this setup should mean that if I now do e.g.

$teitolatex --profile=agora foo.xml

I should get the same result as I would if I left out the –profile … and so indeed I do. Good. Time to start messing about.

I take a peek into the contents of my agora/latex folder. It contains just one file, called to.xsl — which presumably controls the conversion from tei to latex. One day maybe some clever person will add a file called from.xsl which does the opposite. Or not.

The file is rather dull: all it does is remind me that the file is copyright TEI Consortium 2008, and that the library it invokes is « distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY ». Fair enough. It also loads the stylesheet at
../../../latex2/tei.xsl but all it does to modify that is set some mysterious parameter called reencode to false. So clearly I am at liberty to add further modifications in this file… or will be once I have changed permissions on the file.
../../../latex2 (i.e. /usr/share/xml/tei/stylesheet/latex2, sibling of the profiles directory) is the directory with the real biz. It contains files named for most TEI modules, as well as promising looking files like tei-param.xsl. A little sniffing around, and I have discovered the XSL template for procesing the TEI <head> element inside the file core.xsl, which contains the following magic:

That looks to me suspiciously like there should be a parameter called numberHeadings which I should set to false in order to suppress those pesky generated section numbers. (Of course, I’d have found that out immediately if I’d bothered to read the documentation, but …)

Back in my file profiles/agora/latex/to.xsl, I add the following line

<xsl:param name="numberHeadings">false</xsl:param>

and then regenerate the PDF, using the tweaked stylesheet in my agora profile:

teitolatex --profile=agora aaberge_2007.xml
xelatex aaberge_2007.tex

Bingo! no numbering. This could maybe be easier than it looks…

My second problem is trickier. The challenge and the delight of the TEI is precisely its open-endedness, and so it often happens that something which looks plausible in TEI has no obvious translation in some other markup system, such as LaTeX. In my case, how *should* the <byline> element be processed? A grep through the LaTeX directory shows me that at present there is no template at all for it, so my hands are comparatively untied. My first thought is just to add a template like the following to my file:

on the assumption that the bit of text inside the byLine elements might as well be treated in the same way as an author as anything else. But LaTeX is not so liberal: when it finds that I have generated

I therefore copy the whole of the <xsl:template> for docAuthor into my to.xsl file, add the above clause, and blow me down it (nearly) works. I had, of course, forgotten to suppress a second appearance of those pesky text fragments caused by the default processing for <byline>. One more template:

<xsl:template match="tei:byline/text()"/>

fixes that.

Of course, the more I look at this, the less I like it. A much better solution would be to tag the affiliation data as such in the XML source, using an element such as <affiliation> perhaps, and then process it correctly into whatever LaTeX provides for the treatment of such things. But that would, as aforesaid, require some research into what LaTeX can do, as well as changing the Agora schema.

Not a bad way to pass the train journey to Paris, especially when surrounded by kids returning home after the half term hols.