Tuesday, April 18, 2017

Multimedia phylogeny?


Evolutionary concepts have often been transferred to other fields of study, or derived independently in them, especially in anthropology in the broadest sense, covering all cultural products of the human mind. This includes phylogenetic studies of languages, texts, tales, artifacts, and so on — you will find many examples of such studies in this blog. One of the more recent applications has been to what is sometimes called multimedia phylogeny — the research field that "studies the problem of discovering phylogenetic dependencies in digital media".

I have noted before that phylogenetics in the biological sense is an analogy when applied to other fields, because only in biology is genetic information physically transferred between generations — in the other fields, cultural information transfer is all in the minds of the people, not in their genes (see False analogies between anthropology and biology). This analogy often becomes problematic when applied to other fields, because the practical application of bioinformatics techniques separates the informatics from the bio, and the mathematical analyses focus on trying to implement the informatics without any biological justification.


A recent paper that discusses the application of bioinformatics to multimedia phylogeny exemplifies the potential problems:
Guilherme D Marmerola, Marina A Oikawa, Zanoni Dias, Siome Goldenstein, Anderson Rocha (2017) On the reconstruction of text phylogeny trees: evaluation and analysis of textual relationships. PLoS One 11(12): e0167822.
The authors described their background information thus:
Articles on news portals and collaborative platforms (such as Wikipedia), source code, posts on social networks, and even scientific publications or literary works, are some examples in which textual content can be subject to changes in an evolutionary process. In this scenario, given a set of near-duplicate documents, it is worthwhile to find which one is the original and the history of changes that created the whole set. Such functionality would have immediate applications on news tracking services, detection of plagiarism, textual criticism, and copyright enforcement, for instance.
However, this is not an easy task, as textual features pointing to the documents' evolutionary direction may not be evident and are often dataset dependent. Moreover, side information, such as time stamps, are neither always available nor reliable. In this paper, we propose a framework for reliably reconstructing text phylogeny trees, and seamlessly exploring new approaches on a wide range of scenarios of text reusage. We employ and evaluate distinct combinations of dissimilarity measures and reconstruction strategies within the proposed framework.
So, their solution to the separation of bio from informatics is to try a range of techniques, none of which are based on any particular model of how phylogenetic changes might occur in text documents. All of these methods involve distance-based tree-building.

The essential problem, as I see it, is that without a model of change there is no reliable way to separate phylogenetic information from any other type of information. For example, similarity can arise from many sources, only some of which provide information about phylogenetic history — phylogenetic similarity is a form of "special similarity". In biology, other sources of similarity are usually lumped together as chance similarities, such as convergence, parallelism, etc. Without this basic separation of phylogenetic and chance similarity, it does not matter how many distance measures you use, or how many tree-building methods you employ — if you can't separate phylogeny from chance then you are wasting your time constructing a hypothetical  evolutionary history.

The authors' only saving grace is their claim that: "In text phylogeny, unlike stemmatology [the analysis of hand-written rather than digital texts], the fundamental aim is to find the relationships among near-duplicate text documents through the analysis of their transformations over time." The expectation, then, is that the phylogenetic similarity of the texts will be high, which will thus reduce the possibility of chance similarities. Sadly, it will also reduce the probability that the similarities will contain any phylogenetic information at all — this is the classic short-branches-are-hard-to-reconstruct problem in phylogenetics.

For digital texts, the authors employ three distance measures: edit distance, normalized compression distance, and cosine similarity. None of these are model-based in any phylogenetic sense (although the first one is used in alignment programs such as Clustal) — I have discussed this in the post on Non-model distances in phylogenetics. Their tree-building methods include: parsimony, support vector machines (a machine-learning form of classification), and random forests (a decision-tree form of classification). Once again, none of these is model-based in terms of textual changes.

A final issue is the insistence on trees as the model of a phylogeny. In stemmatology, for example, a network is a more obvious phylogenetic model, because hand-written texts can be copied from multiple sources. Indeed, this distinction plays an important role in the first application of phylogenetics to stemmatology (see the post on An outline history of phylogenetic trees and networks). Perhaps this is not an issue for "near-duplicate text documents", but it does seem like an unnecessary restriction. Moreover, one of the empirical examples used in the paper actually has a network history, which therefore does not match the authors' reconstructed tree.