Navigation des articles

Setting bounds in a homogeneous corpus

We inaugurate here a new category, dedicated to sources and research documents. Mostly, we will publish in this category scripts, databases and documents related to articles we have published. The main purpose is, by making our sources available, to allow for everybody who wishes it, to recreate our analyses and, as such, to increase the verifiability and openness of our research, and to allow for interactions with our readers.

In this case, we propose materials related to an article entitled « Setting bounds in an homogeneous corpus: a methodological study applied to medieval literature » who was proposed during last year MASHS conference.

We offer here the databases used, the memo of the R commands, the list of Repeated segments found in the texts, as well as some graphs that were produced during the analyses phase.

Abstract of the paper

The authors present here an exploratory and unspecific method, that does not necessitate any a priori on the data – nor any heavy transformation such as lemmatisation -, that is to be understood as a first step in the apprehension of a corpus. After a first phase of calibration, based on a control sample, they will introduce a method whose heuristic value is to bring out, at different levels, internal divisions of different kinds (diachronic, diatopic, related to authorship or scribes,…), that can then be analysed specifically. They illustrate this method by applying it to a corpus a Occitan medieval texts, the vidas, whose authors and origins are in good part unknown.

Sources

Databases

The first database contains the control sample, a frequency table of 12 troubadours’ cansos, in diplomatic transcription.

The second database contains a frequency table for the vidas of manuscript A.

R commands

The document contains a simple and very partially documentedlist of R commands that were used, as well as an R function created by the authors for the weighting of the texts (To compensate specifically for the different lengths of the texts, we adopted a very basic weighting (let Fij be the frequency of term i in document j and Tj the total of terms in document j, Fij/Tj). This was allowed by the relationship between the number of word forms and total number of words: length of the texts would still be rendered by their lexical richness, while not being the most important element of all subsequent analysis.)

Repeated Segments

For a presentation of the concept of RS and their use, see A. Salem and P. Lafon (1983). L’inventaire des segments répétés d’un texte. Mots 6 (1), 161–177 ; the repeated segments were computed using the Lexico3 software, developed by the SYLED–CLA2T of University Paris 3 Sorbonne–Nouvelle. The shorter segments systematically comprised in a longer one (that is, preceded and followed always by the same two word-forms) were eliminated by the software.