Category: Uncategorized

Why the Difference? Accounting for Variation between the Folio and Globe Editions of Shakespeare’s Plays

To what extent is modern text analysis software capable of dealing with historical data? This is a perennial question asked by those working with digitized historical texts who wish to see how an analysis of such texts can be facilitated by cutting-edge technologies. No doubt the best way to answer the question is to test this software with two versions of the same text, where one version of the text can be considered an older and noticeably different version than the other version.

Enter the Folio and Globe editions of Shakespeare’s plays. The latter was published in 1867 and contains modernized spelling throughout, whereas the former was published in 1623 and maintains the original spelling of Shakespeare’s Early Modern English. Using DocuScope for text analysis and JMP for statistical visualizations, the following dendrogram was created:

The texts highlighted in red are from the Folio edition, whereas the texts highlight in blue come from the Globe edition. One would expect all of Shakespeare’s Folio plays to cluster with their Globe complement here. Much Ado About Nothing is Much Ado About Nothing, after all, regardless of which edition it appears in. But for the most part, this neat pairing off is not what happens: instead, most of the Folio plays are grouping with other Folio plays, and the same is true for the Globe plays. Only a few plays are actually grouping with themselves at the top of the dendrogram. Methinks we have a problem.

Upon closer inspection, I found that 13,667 items were tagged by DocuScope in the Globe edition of Much Ado, but only 11,382 items were tagged in the Folio edition of the same play: a 16.7% difference. An inspection of eleven other Shakespeare plays provides us with an overall mean difference of 17.8%: a difference that cannot be considered good when it comes to tagging accuracy.

But why the disparity? Maybe a closer look at DocuScope can give us an idea.

First the Folio version of the opening scene in Much Ado About Nothing (with the “Interior Thought” and “Public Values” clusters turned on):

And the Globe version of the same scene with the same clusters turned on:

One need not read far to discover what’s (not) being tagged in the older, Folio edition of Much Ado: Learne versus learn. It appears the orthographic rendering of the unstressed final –e is causing DocuScope to overlook this work altogether. We find the same mistake later on with indeede/indeed, kindnesse/kindness, helpe/help, and kinde/kind. Another common problem is Early Modern use of u, which is rendered v in modern orthography: deseru’d vs. deserved, seruice vs. service, and ouerflow vs. overflow. There are also a few punctuation issues causing problems: the use of apostrophe (as we see in deseru’d) and the use of | (con | flict vs. conflict), which probably results from some sort of scanning or other computer error. In other plays, the hyphen was also found to be a possible culprit of DocuScope overlooking certain items (ouer-charg’d vs. overcharged).

Although the overall number of DocuScope omissions on a Folio play is rather large, the actual number of error types is quite small. This gives us hope that, with a bit of modification, it may well be possible to train DocuScope to read non-modern(ized) texts.

October 21, 2011
Phylogenetic inference

Image by Greg McInerny and Stefanie Posavec – textual shifts between editions of Darwin’s Origin of Species (used by kind permission of the artist – see bottom of post for further details).

In advance of starting up some big experiments on the texts being made available by TCP, we’ve been discussing the models developed in mathematics/biology for tracing influence.

This began with a conversation with David Krakauer from the Sante Fe Institute about our work. He works in mathematics and evolutionary biology, and has collaborated a bit with Franco Moretti. We told him about our attempts to group texts by genre and then trace their linguistic predecessors and descendants. He suggested this was similar to the problem of phylogenetic inference in biology.

The problem as we currently understand it: we identify a group of texts within a population based on manifest traits known to humans; we then want to account for the development of these traits among items understood to be earlier in the sequence; we link these traits to sentence level linguistic items; we track the traits via these items.

We are thinking about this, since this is going to be one of our big intellectual problems once we add time to the analysis (so far, we’ve been looking at populations, e.g. Shakespeare’s plays, in pretty much the same time slot).

Here are some starter references (though they are certainly not entry-level in all cases!):

>> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2567038/
>> http://en.wikipedia.org/wiki/Bayesian_inference_in_phylogeny
>> http://fontana.med.harvard.edu/www/Documents/WF/Papers/theory.pdf

In our work to date, we have tried to think carefully about the philosophical and methodological implications of what we are doing, rather than simply focus on the (admittedly often attractive and very interesting) results – so it is important for us to consider the implications of taking on models from other fields (especially when those other fields are better developed than ours).

Jonathan Hope has done some work on biology and linguistic history in the past. And one thing that might make the application of biological models difficult is the different natures of inheritance and ‘traits’ in language and biology. In biology, traits (or the genes that produce them) have to be passed down in a closed, continuous way. We get our genes from our parents, not some random person we bump into on the tube – and it’s impossible for us to naturally acquire a gene, however useful it might be, from another species.

None of this holds for language though. If we’re writing a ‘history’, we’ll want to borrow some traits from other histories – but we don’t have to take everything from other histories, and we can take traits from pretty much any histories we happen to have read: old, recent, famous, unknown. So the status of *generic* traits is very different to *genetic* ones.

In addition, we are not confined to our own linguistic *species*. If we want, we can introduce traits from a completely different species to produce something new. In language, if you want a bat, you can cross rats with sparrows. In biology, you have to wait for one to evolve.

Once we are thinking about genres developing over time, it will be easy for us to assume a biological model of linear generations and influence. It’s a useful way of thinking, and the statistical techniques are powerful, but we’ll need to remember that we aren’t looking at exactly the same kind of process.

A further consideration is the power of the inferencing techniques that have been developed in biology. It is very tempting to want to throw these at our newly available textual data – but one very significant thing to have emerged from our work is the importance of having understandable observations.

If a statistical black box tells you some fact, that is not as interesting or important as being able to understand where a particular thing comes from and how it got there. If some fancy inference algorithm tells you there’s a pattern, it isn’t that helpful unless you can comprehend it, since an incomprehensible or inexplicable pattern might just be an artefact of the process or analysis.

With biology, the models are more well known and trusted, so an incomprehensible pattern is more easy to accept: it could more safely be taken as an indication of a real effect we just haven’t understood yet.

Ultimately, our interest is in building tools to help people understand the complexity in texts. We are less interested in having machines sort it out automatically (indeed, we are probably sceptical that this is really possible). Although, there is also a need for tools to help people sort out what the machines figure out…

References

Jonathan Hope,‘Rats, bats, sparrows and dogs: biology, linguistics and the nature of Standard English’ in The Development of Standard English 1300-1800, Laura Wright (ed.), pp. 49-56, (Cambridge University Press: 2000) ISBN 0-521-77114-5

Stefanie Posavec and Greg McInerny: The (En)tangled Word Bank project (originated at Microsoft Research, Cambridge)

http://research.microsoft.com/en-us/people/a-gregmc/

http://research.microsoft.com/en-us/projects/TextVis/

http://www.itsbeenreal.co.uk/index.php?/on-going/chapter-close-ups/

May 12, 2011

Category: Uncategorized

Why the Difference? Accounting for Variation between the Folio and Globe Editions of Shakespeare’s Plays

Phylogenetic inference