Tuesday, May 23, 2017

A test case for phylogenetic methods and stemmatics: the Divine Comedy

In a previous post I gave an outline of stemmatics, and briefly touched on the adoption and advantages of phylogenetic methods for textual criticism (On stemmatics and phylogenetic methods). Here I present the results of an empirical investigation I have been conducting, in which such methods are used to study some philological dilemmas of a cornerstone work in textual criticism, Dante Alighieri's Divine Comedy. I am reproducing parts of the text and the results of a paper still under review; the NEXUS file for this research is available on GitHub.

Before describing the analysis, I discuss the work and its tradition, as well as some of the open questions concerning its textual criticism. This should not only allow the main audience of this blog to understand (and perhaps question) my work, but it is also a way to familiarize you with the kind of research conducted in stemmatics. After all, the first step is the recensio, a deep review of all information that can be gathered about a work.

The Divine Comedy

The Divine Comedy is an Italian medieval poem, and one of the most successful and influential medieval works. It is written in a rigid structure that, when compared to other works, guaranteed it a certain resistance to copy errors, as most changes would be immediately evident. Composed of three canticas (Inferno, Purgatory, and Paradise), the first of its 100 cantos were written in 1306-07, with the work completed not long before the death of the author in 1321. Written mostly during Dante's exile from his home city, Florence (Tuscany), like many works of the time it was published as the author wrote it, and not only upon completion. In fact, it is even possible, while not proven, that the author changed some cantos and published revisions, thus being himself the source of unresolvable differences.

No original manuscript has survived, but scholarship has traced the development of the tradition from copies and historical research. The poem is one of the most copied works of the Middle Ages, with more than 600 known complete copies, besides 200 partial and fragmentary witnesses. For of comparison, there are around 80 copies of Chaucer's Canterbury Tales,which is itself a successful work by medieval standards

Commercial enterprises soon developed to attend the market demand of its success. In terms of geographical diffusion, quantitative data suggests that, before the Black Death that ravaged the city of Florence in 1348, scribal activity was more intense in Tuscany than in Northern Italy, where the author had died. Among the hypotheses for its textual evolution, the results of my investigation support the widespread hypothesis that Dante published his work with Florentine orthography in Northern Italy. That is, the first copies adopted Northern orthographic standards, which would then revert to Tuscan customs, with occasional misinterpretations, when the work found its way back to Florence. These essentials of the transmission must be considered when curating a critical edition, as the less numerous Northern manuscripts, albeit with an adapted orthography, can in general be assumed to be closer to the archetype (if there ever was one to speak of) than Florentine ones.

The tradition is characterized by intentional contamination, as the work soon became a focus of politics and grammar prescriptivism. Errors and contamination have already been demonstrated in the earliest securely dated manuscript, the Landiano of 1336 (cf. Shaw, 2011), and can be already identified in the first commentaries dating from the 1320s (such as in the one by Jacopo Alighieri, the author's son).

Critical studies

Here are some details about previous studies. I have included considerable stemmatic information, but I include a biological analogy to help make sense for non-experts.

The first critical editions date from the 19th century, but a stemmatic approach would only be advanced at the end of that century, by Michele Barbi. Facing the problem of applying Lachmann's method to a long text with a massive tradition, in 1891 Barbi proposed his list of around 400 loci (samples of the text), inviting scholars to contribute the readings in the manuscripts they had access to. His project, which intended to establish a complete genealogy without the need for a full collatio, had disappointing results, with only a handful of responses. Mario Casella would later (1921) conduct the first formal stemmatic study on the poem, grouping some older manuscripts in two families, α and β, of unequal number of witnesses but equal value for the emendatio. His two families are not rooted at a higher level, but he observed that they share errors supporting the hypothesis of a common ancestor, likely copied by a Northern scribe.

Casella's stemma, reproduced from Shaw (2011).

Forty years later, Giorgio Petrocchi proposed to overcome the large stemma by employing only witnesses dating from before the editorial activity of Giovanni Boccaccio, as his alterations and influence were considered to be too pervasive. Petrocchi defended a cut-off date of 1355 as being necessary for a stemmatic approach that would otherwise have been impossible, given the level of contamination of later copies. The restriction in the number of witnesses was contrasted by his expansion of the collatio to the entire text, criticizing Barbi's loci as subjective selections for which there was no proof of sufficiency.

Making use of analogies with biology, we may say that Barbi proposed to establish a tree from a reduced number of "proteins" for all possible "taxa". Casella considered this to be impracticable and, selecting a few representative "fossils", built a tree from a large number of phenotypic characteristics. Finally, Petrocchi produced a network while considering the entire "genome" for all "fossils" dated from before an event that, while well-supported in theory (we could compare its effects to a profound climate change), was nonetheless arbitrary.

Petrocchi's stemma, reproduced from Shaw (2011).

Questions about Petrocchi's methodology and assumptions were soon raised, particularly regarding the proclaimed influence of Boccaccio, without quantitative proofs either that his editions were as influential as asserted or that all later witnesses were superfluous for stemmatics. Later research focused on questioning his stemma. For example, the absence of consensus about the relationship between the Ash and Ham manuscripts, the supposedly weak demonstration of the polytomy of Mad, Rb, and Urb (the "Northern manuscripts"), and the dating of Gv (likely copied fifty to a hundred years after Petrocchi's assumption). Evidence was presented that Co, a key manuscript in his stemma, could not be an ancestor of Lau (its copyist was still active in the 15th century), and that Ga contained disjunctive errors not found in its supposed decedents. Abusing once more of the biological analogy, the dating of his "fossils" was in some cases plainly wrong.

Federico Sanguineti presented an alternative stemma in 2001, arguing that a rigorous application of stemmatics would evidence errors in Petrocchi. To that end, he decided to resurrect Barbi's loci and trace the first complete genealogy, without arbitrary and a priori decisions about the usefulness of the textual witnesses. Sanguineti defended the suggestion that, after this proper recensio, a small number of manuscripts (which he eventually set to seven) would be sufficient for emendation. His stemma, described as "optimistic in its elegance and minimalism" (Shaw 2011), resulted in a critical edition that heavily relied in a single manuscript, Urb, the only witness of his β family (as Rb was displaced from the proximity it had in Petrocchi's stemma, and Mad was excluded from the analysis). Keeping with the biological analogy, he proposed building a tree from an extremely reduced number of "proteins", but for all "taxa". In the end, however, the reduced number of "proteins" was considered only for seven "taxa", selected mostly due to their age.

Sanguineti's stemma, reproduced from Shaw (2011).

The edition of Sanguineti was attacked by critics, who confronted the limited number of manuscripts used in the emendatio, the position of Rb, the high value attributed to LauSC, and the unparalleled importance of Urb, all resulting in an unexpected Northern coloring to the language of a Florentine writer. Regarding his methodology, reviewers pointed out that stemmatic principles had not been followed strictly, as the elimination was not restricted to descripti, but extendied to branches that were considered to be too contaminated

The digital edition of Prue Shaw (2011) was developed as a project for phylogenetic testing of Sanguineti's assumptions. Her edition includes complete manuscript transcriptions, and the transcriptions include all of the layers of revision of each manuscript (original readings and corrections by later hands), and are complemented by high-quality reproductions of the manuscripts. After testing the validity of Sanguineti's method and stemma, Shaw concluded that his claims do not "stand up to close scrutiny", and that the entire edition is compromised, because Rb "is shown unequivocally to be a collaterale of Urb, and not a member of α as [Sanguineti] maintains".

Applying phylogenetic methods

With the goal of following and, to a large part, replicating Shaw (2011), I have analyzed signals of phylogenetic proximity for validating stemmatic hypotheses, produced both a computer-generated and a computer-assisted phylogeny (equivalent to a stemma), and evaluated the performance of suchphylogenies with methods of ancestral state reconstruction.

I wanted to investigate the proximity of witnesses and the statistical support for the published stemmas. After experiments with rooted graphs, I made a decision to use NeighborNets, in which splits are indicative of observed divergences and edge lengths are proportional to the observed differences. These unrooted split networks were preferable because they facilitated visual investigation, and also provided results for the subsequent steps. These involved exploring the topology and evaluating potential contaminations, guiding the elimination of taxa whose data would be redundant for establishing prior hypotheses on genealogical relationships. Analyses were conducted using all manuscript layers and critical editions, both with and without bootstrapping, thus obtaining results supported in terms of inferred trees as well as of character data.

NeighborNet of the manuscripts and revisions from my data, generated with SplitsTree
(Huson & Bryant 2006)

The analysis confirmed most of the conclusions of Shaw (2011) — there are no doubts about the proximity and distinctiveness of Ash and Ham, with Sanguineti's hypothesis (in which they are collaterals) better supported than Petrocchi's hypothesis (in which the first is an ancestor of the second). The proximity of Mart and Triv was confirmed; but the position of the ancestors postulated by Petrocchi and Sanguineti should be questioned in face of the signals they share with LauSC, perhaps because of contamination. The most important finding, in line with Shaw and in contrast with the fundamental assumption of Sanguineti, is the clear demonstration of the relationship between Rb and Urb.

The relationship analyses allowed the generation of trees for further evaluation. Despite the goal of a full Bayesian tree-inference, I discarded that option because, without a careful and demanding selection of priors, it would yield flawed results. As such, I made the decision to build trees using both stochastic inference and user design (ie. manually). This postponed more complex topology analyses for future research, but generated the structures needed by the subsequent investigation steps; both trees are included in the datafile.

The second tree (shown below), allowing polytomies and manually constructed by myself, tries to combine the findings of Petrocchi and Sanguineti by resolving their differences with the support of the relationship analyses. Using Petrocchi's edition as a gold standard, and considering only single hypothesis reconstructions, parsimonious ancestral state reconstruction agree with 9,016 characters (79.9%). When considering multiple hypotheses, instead, reconstructions agree with 10,226 characters (90.7%). Cases of disagreement were manually analyzed and, as expected, most resulted from readings supported by the tradition but refuted by Petrocchi on exegetic grounds.

My proposed tree for the manuscripts selected by Sanguineti,
generated with PhyD3 (Kreft et al., 2017).

This tree suggests that, in general, Petrocchi's network is better supported than the tree by Sanguineti, as phylogenetic principles lead us to expect — the first was built considering statistical properties and using all of available data, while the second relied in many intuitions and hypothesis never really tested. In particular, it supports the findings of Shaw and, as such, allows us to indicate the critical edition of Petrocchi as the best one. Even more important, however, it is a further evidence of the usefulness of phylogenetic methods, when appropriately used, in stemmatics.


Alagherii, Dantis (2001) Comedìa. Edited by Federico Sanguineti. Firenze: Edizioni del Galluzzo.

Alighieri, Dante (1994) La Commedia Secondo L’antica Vulgata: Introduzione. Edited by Giorgio Petrocchi. Opere di Dante Alighieri v. 1. Firenze: Le Lettere.

Huson, Daniel H.; Bryant, David (2006) Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution 23: 254–267.

Inglese, Giorgio (2007) Inferno, Revisione del testo e commento. Roma: Carocci.

Kreft, Lukasz; Botzki, Alexander; Coppens, Frederik; Vandepoele, Klaas; Van Bel, Michiel (2017) PhyD3: a Phylogenetic Tree Viewer with Extended PhyloXML Support for Functional Genomics Data Visualization. BioRxiv. Doi: 10.1101/107276.

Leonardi, Anna M.C. (1991) Introduzione. In: La Divina Commedia, by Dante Alighieri. Milano: Arnoldo Mondadori Editore.

Shaw, Prue (2011) Commedia: a Digital Edition. Birmingham: Scholarly Digital Editions.

Trovato, Paolo (2016) Metodologia editoriale per la Commedia di Dante Alighieri. Ferrara. https://www.youtube.com/watch?v=BfKUOAR9PXA. Date of access: March 19, 2017.