Tuesday, May 23, 2017

A test case for phylogenetic methods and stemmatics: the Divine Comedy

In a previous post I gave an outline of stemmatics, and briefly touched on the adoption and advantages of phylogenetic methods for textual criticism (On stemmatics and phylogenetic methods). Here I present the results of an empirical investigation I have been conducting, in which such methods are used to study some philological dilemmas of a cornerstone work in textual criticism, Dante Alighieri's Divine Comedy. I am reproducing parts of the text and the results of a paper still under review; the NEXUS file for this research is available on GitHub.

Before describing the analysis, I discuss the work and its tradition, as well as some of the open questions concerning its textual criticism. This should not only allow the main audience of this blog to understand (and perhaps question) my work, but it is also a way to familiarize you with the kind of research conducted in stemmatics. After all, the first step is the recensio, a deep review of all information that can be gathered about a work.

The Divine Comedy

The Divine Comedy
is an Italian medieval poem, and one of the most successful and influential medieval
works. It is written in a rigid structure that, when compared to other works,
guaranteed it a certain resistance to copy errors, as most changes would
be immediately evident. Composed of three canticas (Inferno, Purgatory, and Paradise), the first of its 100 cantos were written in 1306-07, with the work completed not long before the death of the author in 1321. Written mostly during
Dante's exile from his home city, Florence (Tuscany), like many works of the time it was published as the author wrote it, and not only upon completion. In fact, it is even possible, while not proven, that
the author changed some cantos and published revisions, thus being himself the source of unresolvable differences.

No original manuscript has survived, but scholarship has traced
the development of the tradition from copies and historical research. The poem is one of the most copied works of the Middle Ages, with more than 600 known complete copies, besides 200 partial and fragmentary witnesses. For of comparison, there are around 80 copies of Chaucer's Canterbury Tales,which is itself a successful work by medieval standards

Commercial enterprises soon developed to attend the
market demand of its success. In terms of geographical diffusion,
quantitative data suggests that, before the Black Death that ravaged the
city of Florence in 1348, scribal activity was more intense in Tuscany
than in Northern Italy, where the author had died. Among the hypotheses for its textual evolution, the results of my investigation support the
widespread hypothesis that Dante published his work with Florentine orthography in
Northern Italy. That is, the first copies adopted Northern orthographic
standards, which would then revert to Tuscan customs, with occasional
misinterpretations, when the work found its way back to Florence. These
essentials of the transmission must be considered when curating a
critical edition, as the less numerous Northern manuscripts, albeit with
an adapted orthography, can in general be assumed to be closer to the archetype (if there ever was one to speak of)
than Florentine ones.

The tradition is characterized by intentional contamination, as the work soon became a focus of politics and grammar prescriptivism. Errors and contamination have already been demonstrated in the earliest securely dated manuscript, the Landiano of 1336 (cf. Shaw, 2011), and can be already identified in the first commentaries dating from the 1320s (such as in the one by Jacopo Alighieri, the author's son).

Critical studies

Here are some details about previous studies. I have included considerable stemmatic information, but I include a biological analogy to help make sense for non-experts.

The first critical editions date from the 19th
century, but a stemmatic approach would only be advanced at the end of that century, by Michele Barbi. Facing the problem of applying Lachmann's method
to a long text with a massive tradition, in 1891 Barbi proposed his list
of around 400 loci (samples of the text), inviting scholars to contribute the readings
in the manuscripts they had access to. His project, which intended to
establish a complete genealogy without the need for a full collatio,
had disappointing results, with only a handful of responses. Mario
Casella would later (1921) conduct the first formal stemmatic study on the
poem, grouping some older manuscripts in two families, α and β, of unequal
number of witnesses but equal value for the emendatio. His two
families are not rooted at a higher level, but he observed that they
share errors supporting the hypothesis of a common ancestor, likely
copied by a Northern scribe.

Casella's stemma, reproduced from Shaw (2011).

Forty years later, Giorgio Petrocchi proposed to overcome the large stemma
by employing only witnesses dating from before the editorial activity
of Giovanni Boccaccio, as his alterations and influence were considered to be too pervasive. Petrocchi defended a cut-off date of 1355 as being necessary for a
stemmatic approach that would otherwise have been impossible, given the
level of contamination of later copies. The restriction in the number
of witnesses was contrasted by his expansion of the collatio to the entire text, criticizing Barbi's loci as subjective selections for which there was no proof of sufficiency.

Making use of analogies with biology, we may say that Barbi proposed to establish a tree from a reduced number of "proteins" for all possible "taxa". Casella considered this to be impracticable and, selecting a few representative "fossils", built a tree from a large number of phenotypic characteristics. Finally, Petrocchi produced a network while considering the entire "genome" for all "fossils" dated from before an event that, while well-supported in theory (we could compare its effects to a profound climate change), was nonetheless arbitrary.

Petrocchi's stemma, reproduced from Shaw (2011).

Questions about Petrocchi's methodology and assumptions
were soon raised, particularly regarding the proclaimed influence of
Boccaccio, without quantitative proofs either that his editions
were as influential as asserted or that all later witnesses were superfluous
for stemmatics. Later research focused on questioning his stemma. For example, the absence of consensus about the relationship between the Ash and Ham manuscripts, the supposedly weak demonstration of the polytomy of Mad, Rb, and Urb (the "Northern manuscripts"), and the dating of Gv (likely copied fifty to a hundred years after Petrocchi's assumption). Evidence was presented that Co, a key manuscript in his stemma, could not be an ancestor of Lau (its copyist was still active in the 15th century), and that Ga contained disjunctive errors not found in its supposed decedents. Abusing once more of the biological analogy, the dating of his "fossils" was in some cases plainly wrong.

Federico Sanguineti presented an alternative stemma
in 2001, arguing that a rigorous application of stemmatics would
evidence errors in Petrocchi. To that end, he decided to resurrect Barbi's loci
and trace the first complete genealogy, without arbitrary and a priori
decisions about the usefulness of the textual witnesses. Sanguineti defended the suggestion that,
after this proper recensio, a small number of manuscripts (which he eventually
set to seven) would be sufficient for emendation. His stemma, described as "optimistic in its elegance
and minimalism" (Shaw 2011), resulted in a critical edition that
heavily relied in a single manuscript, Urb, the only witness of his β family (as Rb was displaced from the proximity it had in Petrocchi's stemma, and Mad was excluded from the analysis). Keeping with the biological analogy, he proposed building a tree from an extremely reduced number of "proteins", but for all "taxa". In the end, however, the reduced number of "proteins" was considered only for seven "taxa", selected mostly due to their age.

Sanguineti's stemma, reproduced from Shaw (2011).

The edition of Sanguineti was attacked by critics, who confronted the limited number of manuscripts used in the emendatio, the position of Rb, the high value attributed to LauSC, and the unparalleled importance of Urb,
all resulting in an unexpected Northern coloring to the language of a
Florentine writer. Regarding his methodology, reviewers pointed out that stemmatic
principles had not been followed strictly, as the elimination was not
restricted to descripti, but extendied to branches that were considered to be too
contaminated

The digital edition of Prue Shaw (2011) was developed
as a project for phylogenetic testing of Sanguineti's assumptions. Her edition includes complete manuscript transcriptions, and
the transcriptions include all of the layers of revision of each manuscript
(original readings and corrections by later hands), and are complemented
by high-quality reproductions of the manuscripts. After testing the
validity of Sanguineti's method and stemma, Shaw concluded that his claims do not "stand up to close scrutiny", and that the entire edition is compromised, because Rb "is shown unequivocally to be a collaterale of Urb, and not a member of α as [Sanguineti] maintains".
Applying phylogenetic methods

With the goal of
following and, to a large part, replicating Shaw (2011), I have analyzed
signals of phylogenetic proximity for validating stemmatic hypotheses,
produced both a computer-generated and a computer-assisted phylogeny
(equivalent to a stemma), and evaluated the performance of suchphylogenies with methods of ancestral state reconstruction.

I wanted to investigate the proximity of witnesses
and the statistical support for the published stemmas. After experiments with
rooted graphs, I made a decision to use NeighborNets, in which
splits are indicative of observed divergences and edge lengths are
proportional to the observed differences. These unrooted split networks
were preferable because they facilitated visual investigation, and also
provided results for the subsequent steps. These involved exploring the topology
and evaluating potential contaminations, guiding the elimination
of taxa whose data would be redundant for establishing prior hypotheses
on genealogical relationships. Analyses were conducted using all
manuscript layers and critical editions, both with and without
bootstrapping, thus obtaining results supported in terms of inferred
trees as well as of character data.

NeighborNet of the manuscripts and revisions from my data, generated with SplitsTree
(Huson & Bryant 2006)

The analysis confirmed most of the conclusions of Shaw (2011) — there are no doubts about the proximity and distinctiveness of Ash and Ham,
with Sanguineti's hypothesis (in which they are collaterals) better
supported than Petrocchi's hypothesis (in which the first is an ancestor of the
second). The proximity of Mart and Triv was confirmed; but
the position of the ancestors postulated by Petrocchi and Sanguineti
should be questioned in face of the signals they share with LauSC,
perhaps because of contamination. The most important finding, in line
with Shaw and in contrast with the fundamental assumption of Sanguineti,
is the clear demonstration of the relationship between Rb and Urb.

The relationship analyses allowed the generation of trees for
further evaluation. Despite the goal of a full Bayesian tree-inference, I discarded that
option because, without a careful and demanding
selection of priors, it would yield flawed results. As such, I made the decision to build trees using both stochastic inference and user
design (ie. manually). This postponed more complex topology analyses for future research,
but generated the structures needed by the subsequent investigation
steps; both trees are included in the datafile.

The second tree (shown below), allowing
polytomies and manually constructed by myself, tries to combine the
findings of Petrocchi and Sanguineti by resolving their differences with the
support of the relationship analyses. Using Petrocchi's edition as a
gold standard, and considering
only single hypothesis reconstructions, parsimonious ancestral state
reconstruction agree with 9,016 characters (79.9%). When considering multiple
hypotheses, instead,
reconstructions agree with 10,226 characters (90.7%). Cases
of disagreement were manually analyzed and, as expected, most resulted
from readings supported by the tradition but refuted by Petrocchi on
exegetic grounds.

This tree suggests that, in general, Petrocchi's network is better supported than the tree by Sanguineti, as phylogenetic principles lead us to expect — the first was built considering statistical properties and using all of available data, while the second relied in many intuitions and hypothesis never really tested. In particular, it supports the findings of Shaw and, as such, allows us to indicate the critical edition of Petrocchi as the best one. Even more important, however, it is a further evidence of the usefulness of phylogenetic methods, when appropriately used, in stemmatics.