Formats and standards for humanities information

About Naomi Truan

I am a PhD Candidate in Linguistics at Sorbonne Université and the Freie Universität Berlin (cotutelle de thèse).
More information on my academic background and my publications in English, German, and French on the following page: http://cmb.hu-berlin.de/fr/lequipe/profil/naomi-truan/.
I share my thoughts in English and French on my academic blog: https://icietla.hypotheses.org/.

Representations of the Other in the British, French and German Discourse on Europe: A Corpus-Based Contrastive Discursive Analysis

Naomi Truan (Université Paris-Sorbonne / Freie Universität Berlin)

The purpose of this article is to present a successful strategy to import the corpus on which my PhD project is based into the open source software TXM. If you still do not know TXM, click on the link! This software, developed at the Ecole Normale Supérieure (Lyon, France), is open source, regularly updated, and offers a wide range of textometric queries (among them cooccurrences and concordancies, but also many more).

The corpus is delivered with the TEI-XML files and two Stylesheets for the import and the conversion/visualisation in an HTML file (like new shampoos: 2-in-1!). The complete version of this document (with figures and two appendixes) is also to be found on the ORTOLANG server as a PDF document on my workspaces: French, English, German.

As a student in German Studies (focusing mainly on Literature, Philosophy, History, and Translation), I had no prior experience in the huge – but magnificent – world of Computational Linguistics. I discovered it through my PhD in Corpus Linguistics and thanks to the help of Laurent Romary and the TXM team (especially Serge Heiden, Alexey Lavrentev, and Bénédicte Pincemin). Thus, this article is not only here to show you how it works, but also to tell you: Yes, you can! Also with no idea on Computational Linguistics, you can acquire and develop your own strategies to make sense of these weird languages computer engineers work with.

My PhD thesis, entitled “Representations of the Other in the British, French and German Discourse on Europe: A Corpus-Based Contrastive Discursive Analysis”, relies on a qualitative and quantitative linguistic analysis of parliamentary debates in three European countries. You can find the corpora and their accurate description on ORTOLANG: French, English, German.

The corpus has been manually annotated according to the TEI Guidelines. If you wish to consult how the corpus was annotated, please see the document entitled “Corpus Annotation” on the ORTOLANG server (links above).

You have to put all the TEI-XML files into one folder and to add a file named import.properties (with no extension such as .doc, .pdf), in which you write ignoredelements=note|bibl. This way, the statistics of the corpus in TXM will not take into consideration (ignore) the TEI-XML tags <note> and <bibl>.

Then click on “Sélectionner le répertoire des sources” and select the corresponding folder. Please note that for your first import, the import.xml file, which is created during the import, will not be in the folder. If you happen to modify the TEI-XML files, the import.xml file will be recreated at each import.

In “Dossier des sources”, you can add information on the corpus. If you use the corpus I annotated for your own research project, I kindly ask you to refer to it in this section, for instance with following mention: Naomi Truan 2016 – CC BY 4.0.

The “Police d’affichage” depends on your personal taste and does not affect the import at all.

In the section “Langue principale”, do not forget to tick “en” for English, “de” for German, and “fr” for French if you wish to have the corpus syntactically annotated with TreeTagger (the tutorial for installing TreeTagger on TXM is here).

The “Paramètres du segmenteur lexical” do not need to be changed.

For the “Feuille XSL d’entrée”, please use the style sheet in an XSL format provided along with the TEI-XML files, freely adapted from txm-filter-teip5-xmlw-preserve.xsl.

“Editions” and “Commandes” do not need to be changed.

The import of the corpus can begin; you can now visualise the metadata of the corpus by clicking on the information icon.

Please note that the first “Propriétés des unités lexicales” (body, desc, incident, quote, seg) are not reliable; the given numbers do not correspond to anything in the corpus.

On the text level and on the utterance level, though, the information is fully accurate, so that the following metadata enable correct partitions of the corpus according to these variables: date, government, id, party, party-type, position, role, sex, who-party, who-party-type, who-position, who-role, who-sex.

II – HTML Visualisation of the Corpus with the XSL Style Sheet

I will now comment on the XSL Style Sheet, which can be used for the import into TXM (see Part I), but also to enable the visualisation of the corpus as a whole in an HTML-format. On the ORTOLANG server, you can find it under Content > UK TEI-XML Files, or here.

In the XSL Style Sheet, information in green such as <!– Corpus of British Parliamentary Debates –> does not impact on the XSL Style Sheet but simply provides information to guide the reader.

If you open the XSL Style Sheet and the XML Style Sheet together in oXygen XML Editor, and click, within the XML Style Sheet, on the red button on your right, then oXygen XML Editor will automatically run the Style Sheet and open the corpus in an HTML-format in your browser (like a webpage).

Alternatively, and if you have not proceeded to any changes in the corpus, just double-click on the “HTML file UK – Style sheet (2)”, which will automatically open in your browser as well. The procedure described above is necessary only if you wish to encode other tags or to visualise them differently (for instance, if you wish to see the <quote> tags in orange rather than in red) or if you add new tags in the corpus (for instance, if you notice a missing <quote> tag in one of the TEI-XML files of the corpus).

You can then scroll down through the corpus. It enables you a quick search (for instance through ctrl F for people not familiar with TXM, which offers much more queries in this regard) and a quick visualisation (for instance if you feel you have a better impression of the length of an utterance by seeing it on a webpage – how many lines? – rather than counting text units).

The corpus begins with general information, such as: Number of Incidents, Number of Turns, Number of Speakers, Number of Opposition Members, Number of Majority Members. In this regard, I strongly advise to rely on the statistics provided by TXM rather than on the XSL Style Sheet, which appears to be sometimes misleading. For instance, it counts seven Plaid Cymru Members but reports four names, which is inconsistent (there are, actually, fourPlaid Cymru Members):

This is it! Normally, every time you adjust the corpus (correct a typo, do some minor changes, rename a speaker, etc.), the HTML version will follow, enabling you to visualise the last version corpus very quickly and to search through it. At the same time, you can re-import the corpus into TXM by following the previous steps. If you do not rename the corpus, TXM will automatically ask you if you wish to replace the existing corpus. By clicking “yes”, you will update the corpus.

You now can see how to visualise (i.e. make nice!) your corpus through an XSL Style Sheet and an XML Style Sheet especially designed for the purposes of your own research. The Style Sheets can be adapted for every type of corpus following the TEI Guidelines. Thus, it should not be seen as a model, but rather as an example suitable for TXM.

Please feel free to contact me for any question regarding TXM, TEI, XML and HTML formats, but also Corpus Linguistics, Cognitive Linguistics, or Political Discourse! I cannot promise you to be able to answer every technical concern, but I will try my best (and can also forward your question(s) to people who know more than I do): Naomi.Truan@paris-sorbonne.fr. Any comment will be much appreciated!