Phylogenetic inference

Image by Greg McInerny and Stefanie Posavec – textual shifts between editions of Darwin’s Origin of Species (used by kind permission of the artist – see bottom of post for further details).

In advance of starting up some big experiments on the texts being made available by TCP, we’ve been discussing the models developed in mathematics/biology for tracing influence.

This began with a conversation with David Krakauer from the Sante Fe Institute about our work. He works in mathematics and evolutionary biology, and has collaborated a bit with Franco Moretti. We told him about our attempts to group texts by genre and then trace their linguistic predecessors and descendants. He suggested this was similar to the problem of phylogenetic inference in biology.

The problem as we currently understand it: we identify a group of texts within a population based on manifest traits known to humans; we then want to account for the development of these traits among items understood to be earlier in the sequence; we link these traits to sentence level linguistic items; we track the traits via these items.

We are thinking about this, since this is going to be one of our big intellectual problems once we add time to the analysis (so far, we’ve been looking at populations, e.g. Shakespeare’s plays, in pretty much the same time slot).

Here are some starter references (though they are certainly not entry-level in all cases!):

>> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2567038/
>> http://en.wikipedia.org/wiki/Bayesian_inference_in_phylogeny
>> http://fontana.med.harvard.edu/www/Documents/WF/Papers/theory.pdf

In our work to date, we have tried to think carefully about the philosophical and methodological implications of what we are doing, rather than simply focus on the (admittedly often attractive and very interesting) results – so it is important for us to consider the implications of taking on models from other fields (especially when those other fields are better developed than ours).

Jonathan Hope has done some work on biology and linguistic history in the past. And one thing that might make the application of biological models difficult is the different natures of inheritance and ‘traits’ in language and biology. In biology, traits (or the genes that produce them) have to be passed down in a closed, continuous way. We get our genes from our parents, not some random person we bump into on the tube – and it’s impossible for us to naturally acquire a gene, however useful it might be, from another species.

None of this holds for language though. If we’re writing a ‘history’, we’ll want to borrow some traits from other histories – but we don’t have to take everything from other histories, and we can take traits from pretty much any histories we happen to have read: old, recent, famous, unknown. So the status of *generic* traits is very different to *genetic* ones.

In addition, we are not confined to our own linguistic *species*. If we want, we can introduce traits from a completely different species to produce something new. In language, if you want a bat, you can cross rats with sparrows. In biology, you have to wait for one to evolve.

Once we are thinking about genres developing over time, it will be easy for us to assume a biological model of linear generations and influence. It’s a useful way of thinking, and the statistical techniques are powerful, but we’ll need to remember that we aren’t looking at exactly the same kind of process.

A further consideration is the power of the inferencing techniques that have been developed in biology. It is very tempting to want to throw these at our newly available textual data – but one very significant thing to have emerged from our work is the importance of having understandable observations.

If a statistical black box tells you some fact, that is not as interesting or important as being able to understand where a particular thing comes from and how it got there. If some fancy inference algorithm tells you there’s a pattern, it isn’t that helpful unless you can comprehend it, since an incomprehensible or inexplicable pattern might just be an artefact of the process or analysis.

With biology, the models are more well known and trusted, so an incomprehensible pattern is more easy to accept: it could more safely be taken as an indication of a real effect we just haven’t understood yet.

Ultimately, our interest is in building tools to help people understand the complexity in texts. We are less interested in having machines sort it out automatically (indeed, we are probably sceptical that this is really possible). Although, there is also a need for tools to help people sort out what the machines figure out…

References

Jonathan Hope,‘Rats, bats, sparrows and dogs: biology, linguistics and the nature of Standard English’ in The Development of Standard English 1300-1800, Laura Wright (ed.), pp. 49-56, (Cambridge University Press: 2000) ISBN 0-521-77114-5

Stefanie Posavec and Greg McInerny: The (En)tangled Word Bank project (originated at Microsoft Research, Cambridge)

http://research.microsoft.com/en-us/people/a-gregmc/

http://research.microsoft.com/en-us/projects/TextVis/

http://www.itsbeenreal.co.uk/index.php?/on-going/chapter-close-ups/

More posts

AI and the Pointing Game

A Map of Early English Print

Latour, the Digital Humanities, and the Divided Kingdom of Knowledge

The Great Work Begins: EEBO-TCP in the wild